What have the authors stated for future works in "On the interplay of parallelization, program performance, and energy consumption" ?

In this paper, the authors developed an analytical framework to study the trade-offs between parallelization, program performance, and energy consumption. The authors considered two machine models ; one assumes that individual processors can not be turned off independently, and the other assumes that they can. When processors can be individually turned off, the analysis indicates that the minimum total energy is independent of the number of processors used for executing the parallel section, while the energy-delay product is minimized when the maximum number of available processors are used during the parallel execution section. The demonstrated substantial power advantage that can be gained from turning off individual processors is a great incentive to designing multicore processors with the capability of turning off individual processors.

(Open Access) On the Interplay of Parallelization, Program Performance, and Energy Consumption (2010) | Sangyeun Cho

On the Interplay of Parallelization, Program

Performance, and Energy Consumption

Sangyeun Cho, Member, IEEE, and Rami G. Melhem, Fellow, IEEE

Abstract—This paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application,

program performance, and energy consumption. Given the ratio of serial and parallel portions in an application and the number of

processors, we derive optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total

energy consumption or minimize the energy-delay product. The impact of static power is revealed by considering the ratio between

static and dynamic power and quantifying the advantages of adding to the architecture capability to turn off individual processors and

save static energy. We further determine the conditions under which one can obtain both energy and speed improvement, as well as

the amount of improvement. While the formulas we obtain use simplifying assumptions, they provide valuable theoretical insights into

energy-aware processor resource management. Our results form a basis for several interesting research directions in the area of

energy-aware multicore processor architectures.

Index Terms—Multicore processor, Amdahl’s law, dynamic voltage and frequency scaling (DVFS), energy-delay product (EDP).

1INTRODUCTION

surge of attention is being paid to parallel processing

with the recent emergence of commodity multicore

processors. Microprocessors carrying two to eight general

processing cores are commercially available [2], [15], [21],

[31], [32] and projections suggest that future technologies

will allow integrating many more cores, potentially in the

order of hundreds to thousands, in a single chip [4], [22].

Abundant on-chip processing elements and much reduced

processor-to-processor communication overhead will offer

an unprecedented environment for efficient and affordable

parallel processing on every desktop.

While the increased amount of on-chip computing

resources promises higher performance through parallel

execution of applications, suppressing the power and

energy consumption remains an even more stringent

constraint to the design and management of such proces-

sors [5], [26]. The allowed power level has been and will be

limited to the same constant value (150W) and the chip

operating voltage will not be significantly lowered in the

future, according to the ITRS projection [19]. Battery

capacity and efficiency are not improving as quickly as

the number of transistors in a chip increases. Henceforth,

many previously developed low-power and low-energy

ideas will play an even more significant role in future

multicore processors, including highly beneficial dynamic

voltage and frequency scaling (DVFS or simply DVS)

techniques [11], [35], [36] and the capability of turning off

individual processor cores [20].

This paper presents a theoretical study on the interplay

of parallelization, program performance, and energy con-

sumption. It has been previously pointed out that parallel

processing can be used to lower energy instead of

improving performance. For instance, Borkar [5] suggested

that a perfect two-way parallelization would lead to half the

clock frequency (and voltage), one-quarter of the energy

consumption, and one-eighth of the power density com-

pared with the sequential execution given the same

execution time constraint. Little work has been done,

however, to understand how parallelizing an application

would enable us to achieve both speedup and energy

improvement or how to obtain the best energy improve-

ment given an application and a processor architecture. We

will address the following specific questions:

1. What is the maximum energy improvement due to

parallelization, and how can we determine the

processor speeds to achieve that improvement?

2. How does static power affect the energy-optimal

program speedup and energy consumption?

3. Given a target speedup, how do we set the processor

speeds to minimize energy?

4. What is the condition for obtaining the minimum

energy-delay product?

Our goal in this work is to follow a simple analytical

approach, similar to the one used in Amdahl’s law [3], to

derive formulas that describe the behavior of the system.

For example, according to Amdahl’s law, the maximum

program speedup due to parallelization is

Speedup ¼

s þ

; ð1Þ

where ðs þ pÞ¼1;s (p) is the ratio of the serial (parallel)

portion in the program, and N is the number of processors.

Using the same input parameters, we can calculate how

parallelization improves energy consumption:

342 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 3, MARCH 2010

. S. Cho and R.G. Melhem are with the Department of Computer Science,

University of Pittsburgh, 5407 Sennott Square, 210 S. Bouquet Street,

Pittsburgh, PA 15260. E-mail: {cho, melhem}@cs.pitt.edu.

Manuscript received 21 Oct. 2008; revised 23 Jan. 2009; accepted 19 Feb.

2009; published online 27 Feb. 2009.

Recommended for acceptance by D. Bader.

For information on obtaining reprints of this article, please send e-mail to:

tpds@computer.org, and reference IEEECS Log Number TPDS-2008-10-0427.

Digital Object Identifier no. 10.1109/TPDS.2009.41.

1045-9219/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

Improvement in Dynamic Energy ¼

s þ

ð1Þ=





; ð2Þ

when the parallel program execution time is identical to the

sequential execution time and the dynamic power con-

sumption of a processor running at frequency f is

proportional to f



. The above equation, whose detailed

derivation will be shown in Section 3, illustrates that more

parallelism (larger p and smaller s) and more processors

(larger N) help reduce energy. Fig. 1 presents a plot of (2).

For a typical value of , the energy improvement function

in (2) is growing faster than the speedup function in (1) with

p or N.

The result obtained in this paper reveals interesting

relationships between processor speeds in the sequential

and parallel regions of a program and between dynamic

and static energy consumption. For instance, when indivi-

dual processors cannot be turned off, the total energy is

minimized when the dynamic energy is equal to 1=ð  1Þ

times the static energy regardless of the value of N or s.

Moreover, we find that minimum energy is achieved at

processor speeds smaller than the maximum speeds only

when <ð  1Þ=N, where  is the ratio of the static power

consumption to the dynamic power consumption at the

maximum processor speed. Under this condition, the ratio

between the processor speed of the serial and parallel

section to minimize energy is N

1=

. When the condition

does not hold, the program’s serial section must be

executed at full speed. If >ð  1Þ, processor speeds in

both the serial and the parallel sections must be set to the

maximum speed to achieve the minimum total energy. The

above conditions are greatly relaxed if we can turn off

individual processors to save static energy consumption.

The rest of this paper is organized as follows: Section 2

presents our problem formulation and the two machine

models that we will consider in the paper. Sections 3 and 4

study the problem of minimizing energy consumption for

the two machine models, followed by a discussion on

energy-delay product in Section 5. Section 6 further

discusses how our derivations can be used in situations

when there are constant-speed operations (such as memory

access) whose latencies do not depend on the processor

speed. Sections 3, 4, and 5 assume that the processor speeds

simply determine the execution time of a program. Related

work will be discussed and contrasted with our work in

Section 7 and conclusions will be summarized in Section 8.

2PROBLEM FORMULATION AND MACHINE MODELS

2.1 Problem Formulation and Assumptions

We have an application model identical to that of Amdahl’s

law. An application has a serial section that can be executed

by a single processor and a parallel section that can be

executed by any number of processors in the system, i.e., fully

parallelizable. When the number of processors employed is

N, the speedup of the parallel section is N. We do not consider

the overhead of processor-to-processor communications.

We normalize the sequential execution time of the

program to be 1, in order to present our derivation in an

intuitive way. Similarly, we normalize the amount of work

(i.e., number of cycles) in the program to be 1. Therefore, the

maximum clock frequency, F

max

, has a relative speed of 1

and the program has the serial portion whose amount of

work is represented with s, and the parallel portion with p (or

1  s). Fig. 2 shows this arrangement. The program speedup

is denoted by x and the resulting program execution time

with y ¼ 1=x. The clock frequencies for the two regions in the

work, namely, s and p, are calculated as follows:

; ð3Þ

1  s

ðy  tÞN

: ð4Þ

We assume a DVFS scheme where voltage and frequency

are changed linearly. To be general, we also assume that the

power consumption of each processor consists of two

components, a frequency-dependent component that can be

controlled by changing the frequency of the processor

(DVFS), and a frequency-independent component that is

not controlled by DVFS. We call these two components as

“dynamic” and “static,” respectively. There have been

many studies to either theoretically or empirically model

the dependence of the power consumption on the operating

frequency, and most of these studies conclude that this

dependence can be approximated by C  f



, for some value

of   2 [30], [35]. The frequency-independent component

is a constant that depends on the technology and system

CHO AND MELHEM: ON THE INTERPLAY OF PARALLELIZATION, PROGRAM PERFORMANCE, AND ENERGY CONSUMPTION 343

1. In literature  is between 2 and 3, typically 3 [30], [35].

Fig. 1. Achievable dynamic energy improvement assuming  ¼ 3 and

using 1, 2, 3, and 4 processors given the parallel portion’s ratio of a

program.

Fig. 2. Normalized “work” and “time.” “Parallel time” is partitioned into

serial and parallel regions.

architecture. However, assuming that E

max

¼ C  F



max

is the

energy of the frequency-dependent component at the

maximum processor speed, the energy for the frequency-

independent component can be normalized with regard to

max

and expressed as   E

max

. In this paper, we further

normalize E

max

to 1 so that the static power consumption is

simply . Although the power model described above is

approximate, its simplicity allows us to compare the effect

of parallelism on the choice of processor speed policies

using closed-form formulas. It is common to use simplified

power models to derive DVFS policies [33].

Clearly, the benefit of any DVFS technique largely

depends on the value of . Specifically, for systems with

relatively large , the frequency-independent component of

power dominates the total energy consumption, and thus,

applying DVFS techniques will not produce any appreci-

able energy savings. However, in systems with relatively

small , applying DVFS techniques can reduce the total

system energy consumption. Many DVFS algorithms have

been proposed recently and demonstrated to reduce power

consumption [11], [18], [25], [29], [35], [36], and many

commercial microprocessors are now equipped with DVFS

capabilities [8], [9], [16], [27].

For a given problem, s is fixed, and for a given

architecture, N, , and  are fixed. Hence, the energy

consumption, E, is a function of t and y. Specifically,

Eðt; yÞ¼t  f



þ N ðy  tÞf



þ N    y: ð5Þ

In (5), the three terms represent energy for the serial

portion, energy for the parallel portion, and energy for the

static power consumption during the whole execution

time, respectively. We do not consider the processor

temperature, and hence, the term for the static energy

is the product of per-processor power consumption rate, ,

the number of processors, N, and the total execution time,

y. We do not assume a specific interprocessor network

topology and do not consider energy consumption of the

interprocessor network.

In our problem formulation, we assumed that the

processor speed (or processor’s clock frequency) solely

determines the runtime of a program. However, certain

“constant-speed” operations, such as memory access

(caused by a cache miss in cache-based systems) and I/O

processing, may take a fixed amount of time that is

independent of the processor speeds. Consequently, in-

creasing or decreasing the processor speed will have

“sublinear” effect (rather than “linear”) on performance

[14]. For clear presentation and intuitive discussion, we will

not consider the impact of the constant-speed operations on

program execution time (and hence, energy consumption)

in the following three sections. However, Section 6 will

specifically discuss how constant-speed operations affect

our derivations and intuitions learned.

Finally, it is important to note that this work pays little

attention to practical issues of parallelizing an application

or mapping serial and parallel regions of an application to

multiple cores. We assume (as Amdahl’s law suggests) that

an application is perfectly parallelized given its parallelism

and work described by a parallel code region can be

perfectly distributed to an arbitrary number of processors.

2.2 Two Machine Models

In the problem formulation in (5), we assumed that

processors consume static energy in both the serial and

the parallel regions. That is, even when a processor is not

assigned a task to execute (and thus, sits idle), it consumes

static energy. Naturally, one would regard an idle processor

as an opportunity to save energy consumption if processors

can be turned off when not busy, given a mechanism to turn

off individual processors. Because low-energy consumption

via suppressing static power will be increasingly important,

viable mechanisms such as separate power supply designs

are actively explored in the context of multicore processors

in industry [20]. Hence, we will study in this work two

machine models: one without and one with the capability to

turn off individual processors. Throughout this paper, we

refer to these two machine models M

and M

For the simplicity of our derivation, we assume that M

can turn off or on processors without any overhead. Given

the same processor speed setting, the dynamic energy

consumption of M

is the same as that of M

: sum of the

first two terms in (5). Only the static energy part need be

replaced by t   þ N ðy  tÞ.

We further assume that processors can run at an

arbitrary clock frequency, subject to the maximum fre-

quency F

max

in both machine models. While processor clock

frequencies in real chip implementations are typically

discrete, previous work showed that energy savings using

discrete frequencies closely match that of continuous

frequencies [18]. The speedup x one would achieve with

parallelization and processor speed scaling is subject to

x 

s þ

ð6Þ

according to Amdahl’s law in (1). It is clear that this

condition is equivalent to f

 F

max

¼ 1, and y  s þ

We will discuss the two machine models in detail in the

following two sections.

3MACHINE MODEL M

:PROCESSORS CANNOT

BE TURNED OFF INDIVIDUALLY

3.1 The Case of x ¼ 1

We first obtain the minimum energy consumption when

x ¼ 1 (or y ¼ 1), i.e., program execution time is unchanged.

Imposing the condition x ¼ 1 is similar to setting a

deadline, which is the original sequential execution time,

to finish the computation. While one can further reduce the

dynamic energy by slowing down processors, considering

the case of x ¼ 1 provides us with interesting insights as

well as a basis for later discussions. To minimize the total

energy, we rewrite (5) as

EðtÞ¼t





þ Nð1  tÞ

1  s

ð1  tÞN





þ N: ð7Þ

Next, we obtain the derivative of EðtÞ with respect to t,

dEðtÞ

ð  1Þs



ð  1Þð1  sÞ



ð1  tÞ



 N

ð1Þ

; ð8Þ

344 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 3, MARCH 2010

and then, we compute the value of t which minimizes EðtÞ

by setting dEðtÞ=dt to 0 and obtain

1  t

1  s

 N

ð1Þ



: ð9Þ

Hence, the value of t which minimizes EðtÞ is



s þ

ð1Þ=

: ð10Þ

We are ready to obtain the values of f

and f

, which

minimize EðtÞ using (3), (4), and (10). Specifically,



¼ s þ

ð1Þ=

; ð11Þ



¼ s þ

ð1Þ=



 N





; ð12Þ

¼ f





: ð13Þ

Both f



and f



are a function of s and N in (11) and (12),

and (13) shows the relationship between f

and f

when

EðtÞ is minimized. Interestingly, the ratio between the two

frequencies, f



, is a function of N, but not s.

Finally, from (7) and (10), we obtain the minimum

energy consumption

min

¼ Eðt



Þ¼ s þ

ð1Þ=





þ N  : ð14Þ

Fig. 1 depicts the maximum energy improvement due to

parallelization (E

1

min

) when the number of processors is

varied between one and four, assuming  ¼ 3 and  ¼ 0 .It

is clear that energy improvement is a function mono-

tonically increasing with p and N. The curves are also

higher than those given by Amdahl’s law (not shown).

Fig. 3 shows how the overall energy (EðtÞ) changes as we

adjust t. It also presents t



, the value of t that minimizes

EðtÞ. Note that the optimal solution obtained for f



and f



is feasible since both the frequencies are smaller than the

maximum frequency F

max

¼ 1.

3.2 The Case of Unrestricted x

Amdahl’s law explores the effect of parallelization on

speedup, and we have described the effect of parallelization

on energy consumption when the program execution time

is unchanged, i.e., x ¼ 1. We note that the fixed value of x,

in turn, fixes the static energy consumption. We have thus

focused only on the dynamic energy consumption. Hence,

the optimal processor speeds in (11) and (12) are indepen-

dent of . In this section, we relax the program execution

time constraint and revisit the same problem of minimizing

the total energy consumption. Unrestricting x will expose

the impact of static power consumption on the optimal

speed of a program’s serial and parallel sections.

We begin by setting the derivatives of (5) with respect to

both t and y to zero as follows:

ð  1Þs



ð  1Þð1  sÞ



ðy  tÞ



 N

ð1Þ

¼ 0 ¼)

y  t

1  s

 N

ð1Þ=

;

ð15Þ

ð  1Þð1  sÞ



ðy  tÞ



 N

1

 N



¼ 0 ¼)

ðy  tÞ¼

  1









1  s

ð16Þ

Solving (15) and (16) for t and y gives



  1

N





 s; ð17Þ



  1

N





 s þ

ð1Þ=



: ð18Þ

With t



and y



, we can use (3) and (4) to calculate the

optimum frequencies



N

  1





; ð19Þ





  1





¼ f





; ð20Þ

from which we can compute the minimum energy. Again,

note that f



and f



are independent of s and that the ratio

between them is N



. An interesting observation is that at f



and f



, the dynamic energy is given by

dynamic

¼ t



 f



þ N ðy



 t



Þf



  1

 N  y



;

ð21Þ

which is equal to 1=ð  1Þ of the static energy,

static

¼ N  y



. In other words, the total energy consump-

tion is minimized when the dynamic energy consumption is

1=ð  1Þ times the static energy. This relation holds during

the execution of both the serial and the parallel sections of

the program.

The above solution is only applicable if both f



and f



are smaller than F

max

, however, necessitating that

 ð  1Þ=N.If, the ratio between the static and

dynamic power, is large so that it is not possible to

maintain the aforementioned relation between the static and

CHO AND MELHEM: ON THE INTERPLAY OF PARALLELIZATION, PROGRAM PERFORMANCE, AND ENERGY CONSUMPTION 345

Fig. 3. Dynamic energy consumption versus serial time for two cases,

s ¼ 0:25 and s ¼ 0:5, when N ¼ 4. The bound of t is marked with “X”

(when f

¼ F

max

¼ 1) and “O” (when f

¼ F

max

¼ 1). The minimum

energy point in each curve (at t ¼ t



) is marked with a filled rectangle.

dynamic energy, we should set f

¼ 1 and find the values of

y and f

that minimize the total energy consumption.

Denoting these values by y



and f



, we obtain



¼ s þ





  1







; ð22Þ





  1





: ð23Þ

Again, these values result in the dynamic power consump-

tion being 1=ð  1Þ times the static power consumption

during the execution of the parallel portion of the program.

In order to summarize the relationship between  and

speedup that results in minimum energy consumption, we

show that relationship in Fig. 4. In this figure, the values of

 are divided into three regions. When  ð  1Þ=N, the

solution for the optimum energy consumption problem is

given by (18), (19), and (20). When ð  1Þ=N <  ð  1 Þ,

the solution is given by f

¼ 1, (22), and (23). Finally, when

>ð  1Þ, the solution is given by f

¼ f

¼ 1, and the

speedup is that given by Amdahl’s law (1).

3.3 Optimal Energy Consumption Given a Speedu p

We have thus far considered the problem of calculating the

optimal speeds of processors (hence, program speedup) to

minimize the total energy consumption given p, , and N.

In this section, we consider the problem of how to set the

speeds of the processors (f

and f

) to minimize the total

energy consumption when the target program speedup x

(or, equivalently, y) is specified. Because the static energy

N  y is immediately determined given x, we only need to

minimize the dynamic energy while meeting the program

speedup requirement and our solution derived from (3), (4),

(18), (19), and (20) is as follows:

If x 

s þ

ð1Þ=



¼ xf



s;x¼1



¼ xf



p;x¼1

; ð24Þ

s þ

ð1Þ=

<x

s þ



¼ 1;f



Nð1  sxÞ

; ð25Þ

where f



s;x¼1

and f



p;x¼1

are the optimal frequencies when

x ¼ 1, as given in (11) and (12), respectively. We call the

interval in (24) as the linear frequency scaling interval because

the energy-optimal f

and f

can be obtained by simply

scaling f



s;x¼1

and f



p;x¼1

by a factor of x. We also note that

the upper bound of the condition in (24) is, in fact,

equivalent to  ð  1Þ=N.

Fig. 5 shows how the minimum energy consumption

changes as we target a different program speedup, along

with the contribution of the dynamic and static energy

consumption. It is noticeable from the plot that the

dynamic energy of the sequential region saturates at

around x ¼ 2:3. This is due to the inability to scale f

beyond F

max

. Finally, when f

¼ f

¼ 1 (i.e., at the max-

imum speedup), the dynamic energy is 1; it is the same as

that of sequential execution.

We point out that static power consumption plays an

important role in determining the minimum energy con-

sumption of an application. Fig. 6 depicts the improvement

ratio of the minimum energy at different program speed-

ups, relative to the baseline sequential execution of a given

application. The plot clearly demonstrates that a smaller 

leads to a larger energy improvement ratio at any selected

program speedup. Moreover, the largest energy improve-

ment ratio occurs at a smaller program speedup. In other

words, one can slow down processor speeds further to

benefit from reducing dynamic energy to a greater degree

before static energy starts to offset the benefit, if  is small.

346 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 3, MARCH 2010

Fig. 4.  changes the speedup of a program when its energy

consumption is minimized. We assume that  ¼ 3.

Fig. 5. Optimal energy at the program speedup of x when  ¼ 3. The

thick dotted line shows the sequential machine’s energy consumption

(1 þ ).

Fig. 6. Energy improvement at different speedups over sequential

execution.

On the Interplay of Parallelization, Program Performance, and Energy Consumption

Figures

Citations

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

Survey of Energy-Cognizant Scheduling Techniques

CoRAM: an in-fabric memory architecture for FPGA-based computing

Scheduling Precedence Constrained Tasks with Reduced Processor Energy on Multiprocessor Computers

Understanding the future of energy-performance trade-off via DVFS in HPC environments

References

JouleTrack: a web based tool for software energy profiling

Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors

Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters

POWER5 System microarchitecture

Dynamic power-performance adaptation of parallel computation on chip multiprocessors

Related Papers (5)

Amdahl's Law in the Multicore Era

Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era

A scheduling model for reduced CPU energy

Energy Conscious Scheduling for Distributed Computing Systems under Different Operating Conditions

Scheduling with dynamic voltage/speed adjustment using slack reclamation in multiprocessor real-time systems

Frequently Asked Questions (2)

Q1. What are the contributions mentioned in the paper "On the interplay of parallelization, program performance, and energy consumption" ?

Q2. What have the authors stated for future works in "On the interplay of parallelization, program performance, and energy consumption" ?