scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Performance-effective and low-complexity task scheduling for heterogeneous computing

TL;DR: Two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time are presented, called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm.
Abstract: Efficient application scheduling is critical for achieving high performance in heterogeneous computing environments. The application scheduling problem has been shown to be NP-complete in general cases as well as in several restricted cases. Because of its key importance, this problem has been extensively studied and various algorithms have been proposed in the literature which are mainly for systems with homogeneous processors. Although there are a few algorithms in the literature for heterogeneous processors, they usually require significantly high scheduling costs and they may not deliver good quality schedules with lower costs. In this paper, we present two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time, which are called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm. The HEFT algorithm selects the task with the highest upward rank value at each step and assigns the selected task to the processor, which minimizes its earliest finish time with an insertion-based approach. On the other hand, the CPOP algorithm uses the summation of upward and downward rank values for prioritizing tasks. Another difference is in the processor selection phase, which schedules the critical tasks onto the processor that minimizes the total execution time of the critical tasks. In order to provide a robust and unbiased comparison with the related work, a parametric graph generator was designed to generate weighted directed acyclic graphs with various characteristics. The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithms significantly surpass previous approaches in terms of both quality and cost of schedules, which are mainly presented with schedule length ratio, speedup, frequency of best results, and average scheduling time metrics.

Summary (3 min read)

1 INTRODUCTION

  • In the next section, the authors define the research problem and the related terminology.
  • Section 4 introduces their scheduling algorithms (the HEFT and the CPOP Algorithms).
  • Section 5 presents a comparison study of their algorithms with the related work, which is based on randomly generated task graphs and task graphs of several real applications.

3.1 Task-Scheduling Heuristics for Heterogeneous Environments

  • The first phase groups the tasks that can be executed in parallel using the level attribute.
  • The second phase assigns each task to the fastest available processor.
  • Within the same level, the task with the highest computation cost has the highest priority.
  • Each task is assigned to a processor that minimizes the sum of the task's computation cost and the total communication costs with tasks in the previous levels.

4.1 Graph Attributes Used by HEFT and CPOP Algorithms

  • The downward ranks are computed recursively by traversing the task graph downward starting from the entry task of the graph.
  • For the entry task n entry , the downward rank value is equal to zero.

5 EXPERIMENTAL RESULTS AND DISCUSSION

  • The authors present the comparative evaluation of their algorithms and the related work given in Section 3.1.
  • For this purpose, the authors consider two sets of graphs as the workload for testing the algorithms: randomly generated application graphs and the graphs that represent some of the numerical real world problems.
  • First, the authors present the metrics used for performance evaluation, which is followed by two sections on experimental results.

X II

  • The SLR of a graph (using any algorithm) cannot be less than one since the denominator is the lower bound.
  • The taskscheduling algorithm that gives the lowest SLR of a graph is the best algorithm with respect to performance.
  • Average SLR values over several task graphs are used in their experiments. .
  • The speedup value for a given graph is computed by dividing the sequential execution time (i.e., cumulative computation costs of the tasks in the graph) by the parallel execution time (i.e., the makespan of the output schedule).
  • The sequential execution time is computed by assigning all tasks to a single processor that minimizes the cumulative of the computation costs.

3. . Number of Occurrences of Better Quality of

  • The number of times that each algorithm produced better, worse, and equal quality of schedules compared to every other algorithm is counted in the experiments.
  • The running time (or the scheduling time) of an algorithm is its execution time for obtaining the output schedule of a given task graph.
  • This metric basically gives the average cost of each algorithm.
  • Among the algorithms that give comparable SLR values, the one with the minimum running time is the most practical implementation.
  • The minimization of SLR by checking all possible task-processor pairs can conflict with the minimization in the running time.

5.2 Randomly Generated Application Graphs

  • In their study, the authors first considered the randomly generated application graphs.
  • A random graph generator was implemented to generate weighted application DAGs with various characteristics that depend on several input parameters given below.
  • The authors simulation-based framework allows assigning sets of values to the parameters used by random graph generator.
  • This framework first executes the random graph generator program to construct the application DAGs, which is followed by the execution of the scheduling algorithms to generate output schedules, and, finally, it computes the performance metrics based on the schedules.

5.2.1 Random Graph Generator

  • These combinations give 2,250 different DAG types.
  • Since 25 random DAGs were generated for each DAG type, the total number of DAGs used in their experiments was around STu.
  • Assigning several input parameters and selecting each parameter from a large set cause the generation of diverse DAGs with various characteristics.
  • Experiments based on diverse DAGs prevent biasing toward a particular scheduling algorithm.

5.2.2 Performance Results

  • Finally, the number of times that each scheduling algorithm in the experiments produced better, worse, or equal schedule length compared to every other algorithm was counted for the 56250 DAGs used.
  • Each cell in Table 2 indicates the comparison results of the algorithm on the left with the algorithm on the top.
  • The ªcombinedº column shows the percentage of graphs in which the algorithm on the left gives a better, equal, or worse performance than all other algorithms combined.
  • The ranking of the algorithms, based on occurrences of best results, is {HEFT, DLS, CPOP, MH, LMT}.
  • The ranking with respect to average SLR values was: {HEFT, CPOP, DLS, MH, LMT}.

5.3.1 Gaussian Elimination

  • For the efficiency comparison, the number of processors used in their experiments is varied from 2 to 16, incrementing by the power of 2; the CCR and range percentage parameters have the same set of values.
  • Fig. 9b gives efficiency comparison for Gaussian elimination graphs when the matrix size is 50.
  • The HEFT and DLS algorithms have better efficiency than the other algorithms.
  • Since the matrix size is fixed, an increase in the number of processors decreases the makespan for each algorithm.
  • As an example, when the matrix size is 50 for 16 processors, the DLS algorithm takes 16.2 times longer than the HEFT algorithm to schedule a given graph.

5.3.3 Molecular Dynamics Code

  • This application is part of their performance evaluation since it has an irregular task graph.
  • Since the number of tasks is fixed in the application and the structure of the graph is known, only the values of CCR and range percentage parameters (in Section 5.2) are used in their experiments.
  • Fig. 14a shows the performance of the algorithms with respect to five different CCR values when the number of processors is equal to six.
  • It was also observed that the DLS and LMT algorithms take a running time almost three times longer than the other three algorithms (HEFT, CPOP, and MH).

6 ALTERNATE POLICIES FOR THE PHASES OF THE HEFT ALGORITHM

  • The original HEFT algorithm outperforms these alternates for small CCR graphs.
  • For high CCR graphs, some benefit has been observed by taking critical child tasks into account during processor selection.
  • When QXH gg `TXH, B1 policy slightly outperforms the original HEFT algorithm.
  • TXH, B2 policy outperforms the original algorithm and others alternates by 4 percent.

7 CONCLUSIONS

  • This extension may come up with some bounds on the degradation of makespan given that the number of processors available may not be sufficient.
  • The authors plan to extend the HEFT Algorithm for rescheduling tasks in response to changes in processor and network loads.
  • It is also planned to extent these algorithms for arbitrary-connected networks by considering the link contention.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Performance-Effective and Low-Complexity
Task Scheduling for Heterogeneous Computing
Haluk Topcuoglu, Member, IEEE,SalimHariri,Member, IEEE Computer Society, and
Min-You Wu, Senior Member, IEEE
AbstractÐEfficient application scheduling is critical for achieving high performance in heterogeneous computing environments. The
application scheduling problem has been shown to be NP-complete in general cases as well as in several restricted cases. Because of
its key importance, this problem has been extensively studied and various algorithms have been proposed in the literature which are
mainly for systems with homogeneous processors. Although there are a few algorithms in the literature for heterogeneous processors,
they usually require significantly high scheduling costs and they may not deliver good quality schedules with lower costs. In this paper,
we present two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously
meet high performance and fast scheduling time, which are called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm
and the Critical-Path-on-a-Processor (CPOP) algorithm. The HEFT algorithm selects the task with the highest upward rank
value at each step and assigns the selected task to the processor, which minimizes its earliest finish time with an insertion-based
approach. On the other hand, the CPOP algorithm uses the summation of upward and downward rank values for prioritizing
tasks. Another difference is in the processor selection phase, which schedules the critical tasks onto the processor that minimizes
the total execution time of the critical tasks. In order to provide a robust and unbiased comparison with the related work, a
parametric graph generator was designed to generate weighted directed acyclic graphs with various characteristics. The
comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our
scheduling algorithms significantly surpass previous approaches in terms of both quality and cost of schedules, which are mainly
presented with schedule length ratio, speedup, frequency of best results, and average scheduling time metrics.
Index TermsÐDAG scheduling, task graphs, heterogeneous systems, list scheduling, mapping.
æ
1INTRODUCTION
D
IVERSE sets of resources interconnected with a high-
speed network provide a new computing platform,
called the heterogeneous computing system, which can
support executing computationally intensive parallel and
distributed applications. A heterogeneous computing
system requires compile-time and runtime support for
executing applications. The efficient scheduling of the
tasks of an application on the available resources is one of
the key factors for achieving high performance.
The general task scheduling problem includes the
problem of assigning the tasks of an application to suitable
processors and the problem of ordering task executions on
each resource. When the characteristics of an application
which includes execution times of tasks, the data size of
communication between tasks, and task dependencies are
known a priori, it is represented with a static model.
In the general form of a static task scheduling problem,
an application represented by a directed acyclic graph
(DAG) in which nodes represent application tasks and
edges represent intertask data dependencies. Each node
label shows computation cost (expected computation time)
of the task and each edge label shows intertask commu-
nication cost (expected communication time) between
tasks. The objective function of this problem is to map
tasks onto processors and order their executions so that
task-precedence requirements are satisfied and a mini-
mum overall completion time is obtained. The task
scheduling problem is NP-complete in the general case
[1], as well as some restricted cases [2], such as scheduling
tasks with one or two time units to two processors and
scheduling unit-time tasks to an arbitrary number of
processors.
Because of its key importance on performance, the task
scheduling problem in general has been extensively studied
and various heuristics were proposed in the literature [3],
[4], [5], [6], [7], [8], [9], [10], [11], [13], [12], [16], [17], [18],
[20], [22], [23], [27], [30]. These heuristics are classified into a
variety of categories (such as list-scheduling algorithms,
clustering algorithms, duplication-based algorithm, guided
random search methods) and they are mainly for systems
with homogeneous processors.
In a list scheduling algorithm [3], [4], [6], [7], [18], [22], an
ordered list of tasks is constructed by assigning priority for
each task. Tasks are selected in the order of their priorities
and each selected task is scheduled to a processor which
minimizes a predefined cost function. The algorithms in
this category provide good quality of schedules and their
performance is comparable with the other categories at a
260 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
. H. Topcuoglu is with the Computer Engineering Department, Marmara
University, Goztepe Kampusu, 81040, Istanbul, Turkey.
E-mail: haluk@eng.marmara.edu.tr.
. S. Hariri is with the Department of Electrical and Computer Engineering,
University of Arizona, Tucson, AZ 85721-0104.
E-mail: hariri@ece.arizona.edu.
. M.-Y. Wu is with the Department of Electrical and Computer Engineering,
University of New Mexico, Albuquerque, NM 87131-1356.
E-mail: wu@eece.unm.edu.
Manuscript received 28 Aug. 2000; revised 12 July 2001; accepted 6 Sept.
2001.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number 112783.
1045-9219/02/$17.00 ß 2002 IEEE

lower scheduling time [21], [26]. The clustering algorithms
[3], [12], [19], [25] are, in general for an unbounded number
of processors, so they may not be directly applicable. A
clustering algorithm requires a second phase (a scheduling
module) to merge the task clusters generated by the
algorithm onto a bounded number of processors and to
order the task executions within each processor [24].
Similarly, task duplication-based heuristics are not practical
because of their significantly high time complexity. As an
example, the time complexity of the BTDH Algorithm [30]
and the DSH Algorithm [18] are Ov
4
; the complexity of the
CPFD Algorithm [9] is Oe v
2
for scheduling v tasks
connected with e edges on a set of homogeneous processors.
Genetic Algorithms [5], [8], [11], [13], [17], [31] (GAs) are
of the most widely studied guided random search techni-
ques for the task scheduling problem. Although they
provide good quality of schedules, their execution times
are significantly higher than the other alternatives. It was
shown that the improvement of the GA-based solution to
the second best solution was not more than 10 percent and
the GA-based approach required around a minute to
produce a solution, while the other heuristics required an
execution of a few seconds [31]. Additionally, extensive
tests are required to find optimal values for the set of
control parameters used in GA-based solutions.
The task scheduling problem has also been studied by a
few research groups for the heterogeneous systems [6], [7],
[8], [10], [11], [13], [14]. These algorithms may require
assigning a set of control parameters and some of them
confront with the substantially high scheduling costs [6],
[8], [11], [13]. Some of them partition the tasks in a DAG into
levels such that there will be no dependency between tasks
in the same level [10], [14]. This level-by-level scheduling
technique considers the tasks only in the current level (that
is, a subset of ready tasks) at any time, which may not
perform well because of not considering all ready tasks.
Additionally, the study given in [14] presents a dynamic
remapper that requires an initial schedule of a given DAG
and then improves its performance using three variants of
an algorithm, which is out of the scope of this paper.
In this paper, we propose two new static scheduling
algorithms for a bounded number of fully connected
heterogeneous processors: the Heterogeneous Earliest-
Finish-Time (HEFT) algorithm and the Critical-Path-on-a-
Processor (CPOP) algorithm. Although the static-schedul-
ing for heterogeneous systems is offline, in order to
provide a practical solution, the scheduling time (or
running time) of an algorithm is the key constraint.
Therefore, the motivation behind these algorithms is to
deliver good-quality of schedules (or outputs with better
scheduling lengths) with lower costs (i.e., lower schedul-
ing times). The HEFT Algorithm selects the task with the
highest upward rank (defined in Section 4.1) at each step.
The selected task is then assigned to the processor which
minimizes its earliest finish time with an insertion-based
approach. The upward rank of a task is the length of the
critical path (i.e., the longest path) from the task to an exit
task, including the computation cost of the task. The
CPOP algorithm selects the task with the highest (upward
rank + downward rank) value at each step. The algorithm
targets scheduling of all critical tasks (i.e., tasks on the
critical path of the DAG) onto a single processor, which
minimizes the total execution time of the critical tasks. If
the selected task is noncritical, the processors selection
phase is based on earliest execution time with insertion-
based scheduling, as in the HEFT Algorithm.
As part of this research work, a parametric graph
generator has been designed to generate weighted directed
acyclic graphs for the performance study of the scheduling
algorithms. The graph generator targets the generation of
many types of DAGs using several input parameters that
provide an unbiased comparison of task-scheduling algo-
rithms. The comparison study in this paper is based on both
randomly generated task graphs and the task graphs of real
applications, including the Gaussian Elimination Algorithm
[3], [28], FFT Algorithm [29], [30], and a molecular dynamic
code given in [19]. The comparison study shows that our
algorithms significantly surpass previous approaches in
terms of both performance metrics (schedule length ratio,
speedup, efficiency, and number of occurrences giving best
results) and a cost metric (scheduling time to deliver an
output schedule).
The remainder of this paper is organized as follows: In
the next section, we define the research problem and the
related terminology. In Section 3, we provide a taxonomy of
task-scheduling algorithms and the related work in
scheduling for heterogeneous systems. Section 4 introduces
our scheduling algorithms (the HEFT and the CPOP
Algorithms). Section 5 presents a comparison study of our
algorithms with the related work, which is based on
randomly generated task graphs and task graphs of
several real applications. In Section 6, we introduce
several extensions to the HEFT algorithm. The summary
of the research presented and planned future work is
given in Section 7.
2TASK-SCHEDULING PROBLEM
A scheduling system model consists of an application, a
target computing environment, and a performance criteria
for scheduling. An application is represented by a directed
acyclic graph, G V;E, where V is the set of v tasks
and E is the set of e edges between the tasks. (Task and
node terms are interchangeably used in the paper.) Each
edge i; j2E represents the precedence constraint such
that task n
i
should complete its execution before task n
j
starts. Data is a v v matrix of communication data, where
data
i;k
is the amount of data required to be transmitted from
task n
i
to task n
k
.
In a given task graph, a task without any parent is
called an entry task and a task without any child is called
an exit task. Some of the task scheduling algorithms may
require single-entry and single-exit task graphs. If there is
more than one exit (entry) task, they are connected to a
zero-cost pseudo exit (entry) task with zero-cost edges, which
does not affect the schedule.
We assume that the target computing environment
consists of a set Q of q heterogeneous processors
connected in a fully connected topology in which all
interprocessor communications are assumed to perform
without contention. In our model, it is also assumed that
TOPCUOGLU ET AL.: PERFORMANCE-EFFECTIVE AND LOW-COMPLEXITY TASK SCHEDULING FOR HETEROGENEOUS COMPUTING 261

computation can be overlapped with communication.
Additionally, task executions of a given application are
assumed to be nonpreemptive. W is a v q computation
cost matrix in which each w
i;j
gives the estimated execution
time to complete task n
i
on processor p
j
. Before scheduling,
the tasks are labeled with the average execution costs. The
average execution cost of a task n
i
is defined as
w
i
X
q
j1
w
i;j
=q: 1
The data transfer rates between processors are stored in
matrix B of size q q. The communication startup costs of
processors are given in a q-dimensional vector L. The
communication cost of the edge i; k,whichisfor
transferring data from task n
i
(scheduled on p
m
) to task
n
k
(scheduled on p
n
), is defined by
c
i;k
L
m
data
i;k
B
m;n
: 2
When both n
i
and n
k
are scheduled on the same
processor, c
i;k
becomes zero since we assume that the
intraprocessor communication cost is negligible when it is
compared with the interprocessor communication cost.
Before scheduling, average communication costs are used
to label the edges. The average communication cost of an
edge i; k is defined by
c
i;k
L
data
i;k
B
; 3
where
B is the average transfer rate among the processors
in the domain and
L is the average communication
startup time.
Before presenting the objective function, it is necessary to
define the EST and EFT attributes, which are derived from a
given partial schedule. ESTn
i
;p
j
and EFTn
i
;p
j
are the
earliest execution start time and the earliest execution finish
time of task n
i
on processor p
j
, respectively. For the entry
task n
entry
,
ESTn
entry
;p
j
0: 4
For the other tasks in the graph, the EFT and EST values
are computed recursively, starting from the entry task, as
shown in (5) and (6), respectively. In order to compute the
EFT of a task n
i
, all immediate predecessor tasks of n
i
must
have been scheduled.
ESTn
i
;p
j
max availj; max
n
m
2predn
i
AFTn
m
c
m;i

;
5
EFTn
i
;p
j
w
i;j
ESTn
i
;p
j
; 6
where predn
i
is the set of immediate predecessor tasks of
task n
i
and availj is the earliest time at which processor p
j
is ready for task execution. If n
k
is the last assigned task on
processor p
j
, then availj is the time that processor p
j
completed the execution of the task n
k
and it is ready to
execute another task when we have a noninsertion-based
scheduling policy. The inner max block in the EST equation
returns the ready time, i.e., the time when all data needed by
n
i
has arrived at processor p
j
.
After a task n
m
is scheduled on a processor p
j
, the earliest
start time and the earliest finish time of n
m
on processor p
j
is equal to the actual start time, ASTn
m
, and the actual
finish time, AF T n
m
, of task n
m
, respectively. After all
tasks in a graph are scheduled, the schedule length (i.e.,
overall completion time) will be the actual finish time of the
exit task n
exit
. If there are multiple exit tasks and the
convention of inserting a pseudo exit task is not applied, the
schedule length (which is also called makespan) is defined as
makespan maxfAF T n
exit
g: 7
The objective function of the task-scheduling problem is to
determine the assignment of tasks of a given application to
processors such that its schedule length is minimized.
3RELATED WORK
Static task-scheduling algorithms can be classified into two
main groups (see Fig. 1), heuristic-based and guided
random-search-based algorithms. The former can be further
classified into three groups: list scheduling heuristics,
clustering heuristics, and task duplication heuristics.
List Scheduling Heuristics. A list-scheduling heuristic
maintains a list of all tasks of a given graph according to
their priorities. It has two phases: the task prioritizing (or task
selection) phase for selecting the highest-priority ready task
and the processor selection phase for selecting a suitable
processor that minimizes a predefined cost function (which
can be the execution start time). Some of the examples are
the Modified Critical Path (MCP) [3], Dynamic Level
Scheduling [6], Mapping Heuristic (MH) [7], Insertion-
Scheduling Heuristic [18], Earliest Time First (ETF) [22],
and Dynamic Critical Path (DCP) [4] algorithms. Most of
the list-scheduling algorithms are for a bounded number
of fully connected homogeneous processors. List-schedul-
ing heuristics are generally more practical and provide
better performance results at a lower scheduling time than
the other groups.
Clustering Heuristics. An algorithm in this group maps
the tasks in a given graph to an unlimited number of
clusters. At each step, the selected tasks for clustering can
be any task, not necessarily a ready task. Each iteration
refines the previous clustering by merging some clusters. If
two tasks are assigned to the same cluster, they will be
executed on the same processor. A clustering heuristic
requires additional steps to generate a final schedule: a
cluster merging step for merging the clusters so that the
remaining number of clusters equal the number of
processors, a cluster mapping step for mapping the clusters
on the available processors, and a task ordering step for
ordering the mapped tasks within each processor [24].
Some examples in this group are the Dominant Sequence
Clustering (DSC) [12], Linear Clustering Method [19],
Mobility Directed [3], and Clustering and Scheduling
System (CASS) [25].
Task Duplication Heuristics. The idea behind duplica-
tion-based scheduling algorithms is to schedule a task
graph by mapping some of its tasks redundantly, which
reduces the interprocess communication overhead [9], [18],
[27], [30]. Duplication-based algorithms differ according to
the selection strategy of the tasks for duplication. The
262 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002

algorithms in this group are usually for an unbounded
number of identical processors and they have much higher
complexity values than the algorithms in the other groups.
Guided Random Search Techniques. Guided random
search techniques (or randomized search techniques) use
random choice to guide themselves through the problem
space, which is not the same as performing merely random
walks as in the random search methods. These techniques
combine the knowledge gained from previous search
results with some randomizing features to generate new
results. Genetic algorithms (GAs) [5], [8], [11], [13], [17] are
the most popular and widely used techniques for several
flavors of the task scheduling problem. GAs generate good
quality of output schedules; however, their scheduling
times are usually much higher than the heuristic-based
techniques [31]. Additionally, several control parameters in
a genetic algorithm should be determined appropriately.
The optimal set of control parameters used for scheduling
a task graph may not give the best results for another
task graph. In addition to GAs, simulated annealing [11],
[15] and local search method [16], [20] are the other
methods in this group.
3.1 Task-Scheduling Heuristics for Heterogeneous
Environments
This section presents the reported task-scheduling heuristics
that support heterogeneous processors, which are the
Dynamic Level Scheduling Algorithm [6], the Levelized-
Min Time Algorithm [10], and the Mapping Heuristic
Algorithm [7].
Dynamic-Level Scheduling (DLS) Algorithm. At each
step, the algorithm selects the (ready node, available
processor) pair that maximizes the value of the dynamic
level which is equal to DLn
i
;p
j
rank
s
u
n
i
ESTn
i
;p
j
.
The computation cost of a task is the median value of the
computation costs of the task on the processors. In this
algorithm, upward rank calculation does not consider the
communication costs. For heterogeneous environments, a
new term added for the difference between the task's
median execution time on all processors and its execution
time on the current processor. The general DLS algorithm
has an Ov
3
q time complexity, where v is the number of
tasks and q is the number of processors.
Mapping Heuristic (MH). In this algorithm, the compu-
tation cost of a task on a processor is computed by the
number of instructions to be executed in the task divided by
the speed of the processor. However, in setting the
computation costs of tasks and the communication costs
of edges before scheduling, similar processing elements
(i.e., homogeneous processors) are assumed; the hetero-
geneity comes into the picture during the scheduling
process.
This algorithm uses static upward ranks to assign
priorities. (The authors also experimented by adding the
communication delay to the rank values.) In this algorithm,
the ready time of a processor for a task is the time when the
processor has finished its last assigned task and is ready to
execute a new one. The MH algorithm does not schedule a
task to an idle time slot that is between two tasks already
scheduled. The time complexity, when contention is con-
sidered, is equal to Ov
2
q
3
for v tasks and q processors;
otherwise, it is equal to Ov
2
q.
Levelized-Min Time (LMT) Algorithm. It is a two-phase
algorithm. The first phase groups the tasks that can be
executed in parallel using the level attribute. The second
phase assigns each task to the fastest available processor. A
task in a lower level has higher priority than a task in a
higher level. Within the same level, the task with the highest
computation cost has the highest priority. Each task is
assigned to a processor that minimizes the sum of the task's
computation cost and the total communication costs with
tasks in the previous levels. For a fully connected graph,
the time complexity is Ov
2
q
2
when there are v tasks
and q processors.
TOPCUOGLU ET AL.: PERFORMANCE-EFFECTIVE AND LOW-COMPLEXITY TASK SCHEDULING FOR HETEROGENEOUS COMPUTING 263
Fig. 1. Classification of static task-scheduling algorithms.

4TASK-SCHEDULING ALGORITHMS
Before introducing the details of HEFT and CPOP algorithms,
we introduce the graph attributes used for setting the task
priorities.
4.1 Graph Attributes Used by HEFT and CPOP
Algorithms
Tasks are ordered in our algorithms by their scheduling
priorities that are based on upward and downward
ranking. The upward rank of a task n
i
is recursively
defined by
rank
u
n
i
w
i
max
n
j
2succn
i
c
i;j
rank
u
n
j
; 8
where succn
i
is the set of immediate successors of task n
i
,
c
i;j
is the average communication cost of edge i; j, and w
i
is the average computation cost of task n
i
. Since the rank is
computed recursively by traversing the task graph upward,
starting from the exit task, it is called upward rank. For the
exit task n
exit
, the upward rank value is equal to
rank
u
n
exit
w
exit
: 9
Basically, rank
u
n
i
is the length of the critical path from
task n
i
to the exit task, including the computation cost of
task n
i
. There are algorithms in the literature which
compute the rank value using computation costs only,
which is called static upward rank, rank
s
u
.
Similarly, the downward rank of a task n
i
is recursively
defined by
rank
d
n
i
 max
n
j
2predn
i
rank
d
n
j
w
j
c
j;i

; 10
where predn
i
is the set of immediate predecessors of
task n
i
. The downward ranks are computed recursively
by traversing the task graph downward starting from the
entry task of the graph. For the entry task n
entry
, the
downward rank value is equal to zero. Basically, rank
d
n
i
is the longest distance from the entry task to task n
i
,
excluding the computation cost of the task itself.
4.2 The Heterogeneous-Earliest-Finish-Time (HEFT)
Algorithm
The HEFT algorithm (Fig. 2) is an application scheduling
algorithm for a bounded number of heterogeneous
processors, which has two major phases: a task prioritizing
phase for computing the priorities of all tasks and a
processor selection phase for selecting the tasks in the order
of their priorities and scheduling each selected task on its
ªbestº processor, which minimizes the task's finish time.
Task Prioritizing Phase. This phase requires the priority
of each task to be set with the upward rank value, rank
u
,
which is based on mean computation and mean communica-
tion costs. The task list is generated by sorting the tasks by
decreasing order of rank
u
. Tie-breaking is done randomly.
There can be alternative policies for tie-breaking, such as
selecting the task whose immediate successor task(s) has
higher upward ranks. Since these alternate policies increase
the time complexity, we prefer a random selection strategy.
It can be easily shown that the decreasing order of rank
u
values provides a topological order of tasks, which is a
linear order that preserve the precedence constraints.
Processor Selection Phase. For most of the task schedul-
ing algorithms, the earliest available time of a processor p
j
for a task execution is the time when p
j
completes the
execution of its last assigned task. However, the HEFT
algorithm has an insertion-based policy which considers the
possible insertion of a task in an earliest idle time slot
between two already-scheduled tasks on a processor. The
length of an idle time-slot, i.e., the difference between
execution start time and finish time of two tasks that were
consecutively scheduled on the same processor, should be
at least capable of computation cost of the task to be
scheduled. Additionally, scheduling on this idle time slot
should preserve precedence constraints.
In the HEFT Algorithm, the search of an appropriate idle
time slot of a task n
i
on a processor p
j
starts at the time
equal to the ready
time of n
i
on p
j
, i.e., the time when all
input data of n
i
that were sent by n
i
's immediate
predecessor tasks have arrived at processor p
j
. The search
continues until finding the first idle time slot that is capable of
holding the computation cost of task n
i
. The HEFT algorithm
has an Oe qtime complexity for e edges and q processors.
For a dense graph when the number of edges is
proportional to Ov
2
(v is the number of tasks), the time
complexity is on the order of Ov
2
p.
As an illustration, Fig. 4a presents the schedules obtained
by the HEFT algorithm for the sample DAG of Fig. 3. The
schedule length, which is equal to 80, is shorter than the
schedule lengths of the related work; specifically, the
schedule lengths of DLS, MH, and LMT Algorithms are
91, 91, and 95, respectively. The first column in Table 1 gives
upward rank values for the given task graph. The
scheduling order of the tasks with respect to the HEFT
Algorithm is fn
1
;n
3
;n
4
;n
2
;n
5
;n
6
;n
9
;n
7
;n
8
;n
10
g.
4.3 The Critical-Path-on-a-Processor (CPOP)
Algorithm
Although our second algorithm, the CPOP algorithm
shown in Fig. 5, has the task prioritizing and processor
264 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
Fig. 2. The HEFT algorithm.

Citations
More filters
Journal ArticleDOI
01 Feb 2011
TL;DR: StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.
Abstract: In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.

1,116 citations

Journal ArticleDOI
TL;DR: The taxonomy provides end users with a mechanism by which they can assess the suitability of workflow in general and how they might use these features to make an informed choice about which workflow system would be a good choice for their particular application.

903 citations

Journal ArticleDOI
TL;DR: An integrated view of the Pegasus system is provided, showing its capabilities that have been developed over time in response to application needs and to the evolution of the scientific computing platforms.

701 citations


Cites methods from "Performance-effective and low-compl..."

  • ...The Mapper supports a variety of site selection strategies such as Random, Round Robin, and HEFT [59]....

    [...]

Journal ArticleDOI
TL;DR: Two workflow scheduling algorithms are proposed which aim to minimize the workflow execution cost while meeting a deadline and have a polynomial time complexity which make them suitable options for scheduling large workflows in IaaS Clouds.

580 citations


Additional excerpts

  • ...All rights reserved....

    [...]

Journal ArticleDOI
Hamid Arabnejad1, Jorge G. Barbosa1
TL;DR: The analysis and experiments show that the PEFT algorithm outperforms the state-of-the-art list-based algorithms for heterogeneous systems in terms of schedule length ratio, efficiency, and frequency of best results.
Abstract: Efficient application scheduling algorithms are important for obtaining high performance in heterogeneous computing systems. In this paper, we present a novel list-based scheduling algorithm called Predict Earliest Finish Time (PEFT) for heterogeneous computing systems. The algorithm has the same time complexity as the state-of-the-art algorithm for the same purpose, that is, O(v2.p) for v tasks and p processors, but offers significant makespan improvements by introducing a look-ahead feature without increasing the time complexity associated with computation of an optimistic cost table (OCT). The calculated value is an optimistic cost because processor availability is not considered in the computation. Our algorithm is only based on an OCT that is used to rank tasks and for processor selection. The analysis and experiments based on randomly generated graphs with various characteristics and graphs of real-world applications show that the PEFT algorithm outperforms the state-of-the-art list-based algorithms for heterogeneous systems in terms of schedule length ratio, efficiency, and frequency of best results.

460 citations


Cites background or result from "Performance-effective and low-compl..."

  • ...HEFT has a complexity of Oðv2:pÞ, where v is the number of tasks and p is the number of processors....

    [...]

  • ...Index Terms—Application scheduling, DAG scheduling, random graphs generator, static scheduling Ç...

    [...]

  • ...Clustering heuristics are mainly proposed for homogeneous systems to form clusters of tasks that are then assigned to processors....

    [...]

  • ...In comparison with clustering algorithms, they have lower time complexity, and in comparison with task duplication strategies, their solutions represent a lower processor workload....

    [...]

References
More filters
Book
01 Jan 1979
TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Abstract: This is the second edition of a quarterly column the purpose of which is to provide a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book ‘‘Computers and Intractability: A Guide to the Theory of NP-Completeness,’’ W. H. Freeman & Co., San Francisco, 1979 (hereinafter referred to as ‘‘[G&J]’’; previous columns will be referred to by their dates). A background equivalent to that provided by [G&J] is assumed. Readers having results they would like mentioned (NP-hardness, PSPACE-hardness, polynomial-time-solvability, etc.), or open problems they would like publicized, should send them to David S. Johnson, Room 2C355, Bell Laboratories, Murray Hill, NJ 07974, including details, or at least sketches, of any new proofs (full papers are preferred). In the case of unpublished results, please state explicitly that you would like the results mentioned in the column. Comments and corrections are also welcome. For more details on the nature of the column and the form of desired submissions, see the December 1981 issue of this journal.

40,020 citations

Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

Journal ArticleDOI
TL;DR: In this article, a language similar to logo is used to draw geometric pictures using this language and programs are developed to draw geometrical pictures using it, which is similar to the one we use in this paper.
Abstract: The primary purpose of a programming language is to assist the programmer in the practice of her art. Each language is either designed for a class of problems or supports a different style of programming. In other words, a programming language turns the computer into a ‘virtual machine’ whose features and capabilities are unlimited. In this article, we illustrate these aspects through a language similar tologo. Programs are developed to draw geometric pictures using this language.

5,749 citations

Journal ArticleDOI
TL;DR: It is shown that the problem of finding an optimal schedule for a set of jobs is NP-complete even in the following two restricted cases, tantamount to showing that the scheduling problems mentioned are intractable.

1,356 citations