scispace - formally typeset
Open AccessProceedings ArticleDOI

An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems

TLDR
An efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment that provides more features and capabilities than existing algorithms that schedule only independent tasks in real- time homogeneous systems is investigated.
Abstract
In this paper, we investigate an efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment. It provides more features and capabilities than existing algorithms that schedule only independent tasks in real-time homogeneous systems. In addition, the proposed algorithm takes the heterogeneities of computation, communication and reliability into account, thereby improving the reliability. To provide fault-tolerant capability, the algorithm employs a primary-backup copy scheme that enables the system to tolerate permanent failures in any single processor. In this scheme, a backup copy is allowed to overlap with other backup copies on the same processor, as long as their corresponding primary copies are allocated to different processors. Tasks are judiciously allocated to processors so as to reduce the schedule length as well as the reliability cost, defined to be the product of processor failure rate and task execution time. In addition, the time for detecting and handling a permanent fault is incorporated into the scheduling scheme, thus making the algorithm more practical. To quantify the combined performance of fault-tolerance and schedulability, the performability measure is introduced Compared with the existing scheduling algorithms in the literature, our scheduling algorithm achieves an average of 16.4% improvement in reliability and an average of 49.3% improvement in performability.

read more

Content maybe subject to copyright    Report

An Efficient Fault-tolerant Scheduling Algorithm for Real-time Tasks with
Precedence Constraints in Heterogeneous Systems
Xiao Qin Hong Jiang David R. Swanson
Department of Computer Science and Engineering
University of Nebraska-Lincoln
Lincoln, NE 68588-0115, {xqin, jiang, dswanson}@cse.unl.edu
This work was supported by an NSF grant (EPS-0091900) and a Nebraska University Foundation grant (26-0511-0019)
Abstract
In this paper, we investigate an efficient off-line
scheduling algorithm in which real-time tasks with
precedence constraints are executed in a heterogeneous
environment. It provides more features and capabilities
than existing algorithms that schedule only independent
tasks in real-time homogeneous systems. In addition, the
proposed algorithm takes the heterogeneities of
computation, communication and reliability into account,
thereby improving the reliability. To provide fault-
tolerant capability, the algorithm employs a primary-
backup copy scheme that enables the system to tolerate
permanent failures in any single processor. In this
scheme, a backup copy is allowed to overlap with other
backup copies on the same processor, as long as their
corresponding primary copies are allocated to different
processors. Tasks are judiciously allocated to processors
so as to reduce the schedule length as well as the
reliability cost, defined to be the product of processor
failure rate and task execution time. In addition, the time
for detecting and handling of a permanent fault is
incorporated into the scheduling scheme, thus making the
algorithm more practical. To quantify the combined
performance of fault-tolerance and schedulability, the
performability measure is introduced. Compared with the
existing scheduling algorithms in the literature, our
scheduling algorithm achieves an average of 16.4%
improvement in reliability and an average of 49.3%
improvement in performability.
1. Introduction
Heterogeneous distributed systems have been
increasingly used for scientific and commercial
applications, including real-time safety-critical
applications, in which the system depends not only on the
results of a computation, but also on the time instants at
which these results become available. Examples of such
applications include aircraft control, transportation
systems and medical electronics. To obtain high
performance for real-time heterogeneous systems,
scheduling algorithms play an important role. While a
scheduling algorithm maps real-time tasks to processors
in the system such that deadlines and response time
requirements are met, the system must also guarantee its
functional and timing correctness even in the presence of
faults.
The proposed algorithm, referred to as eFRCD (efficient
Fault-tolerant Reliability Cost Driven Algorithm),
endeavors to comprehensively address the issues of fault-
tolerance, reliability, real-time, task precedence
constraints, and heterogeneity. To tolerate one processor
permanent failure, the algorithm uses a Primary/Backup
technique to allocate two copies of each task to different
processors. To further improve the quality of the schedule,
a backup copy is allowed to overlap with other backup
copies on the same processor, as long as their
corresponding primary copies are allocated to different
processors. As an added measure of fault-tolerance, the
proposed algorithm also considers the heterogeneities of
computation and reliability, thereby improving the
reliability without extra hardware cost. More precisely,
tasks are judiciously allocated to processors so as to
reduce the schedule length as well as the reliability cost,
defined to be the product of processor failure rate and task
execution time. In addition, the time for detecting and
handling of a permanent fault is incorporated into the
scheduling scheme, thus making the algorithm more
practical.
The rest of the paper is organized as follows. Section 2
briefly presents related work in the literature. Section 3
describes the workload and the system characteristics.
Section 4 proposes the eFRCD algorithm and the main
principles behind it, including theorems used for
presenting the algorithm. Performance evaluation is given
in Section 5. Section 6 concludes the paper by
summarizing the main contributions of this paper.
Proceedings of the International Conference on Parallel Processing (ICPP’02)
0-7695-1677-7/02 $17.00 © 2002 IEEE

2. Related work
The issue of scheduling on heterogeneous systems has
been studied in the literature in recent years. A scheduling
scheme, STDP, for heterogeneous systems was developed
in [16]. In [3,17], reliability cost was incorporated into
scheduling algorithms for tasks with precedence
constraints. However, these algorithms neither provide
fault-tolerance nor support real-time applications.
Previous work has been done to facilitate real-time
computing in heterogeneous systems. In [7], a solution for
the dynamic resource management problem in real-time
heterogeneous systems was proposed. These algorithms,
however, cannot tolerate any processor failure. Fault-
tolerance is considered in the design of real-time
scheduling algorithms to make systems more reliable.
In paper [6], a mechanism was proposed for supporting
adaptive fault-tolerance in a real-time system. Liberato et
al. proposed a feasibility-check algorithm for fault-
tolerant scheduling [8]. The well-known Rate-Monotonic
First-Fit assignment algorithm was extended in [2].
However, both of the above algorithms assume that the
underlying system either is homogeneous or consists of a
single processor.
The algorithm in [1] is a real-time scheduling algorithm
for tasks with precedence constraint, but it does not
support fault-tolerance. Manimaran et al. [9] and Mosse et
al. [4] have proposed dynamic algorithms to schedule
real-time tasks with fault-tolerance requirements on
multiprocessor systems, but the tasks scheduled in their
algorithms are independent of one another and are
scheduled on-line. Martin [10] devised an algorithm on
the same system and task model as that in [4]. Oh and Son
studied a real-time and fault-tolerant scheduling algorithm
that statically schedules a set of independent tasks [12].
Two common features among these algorithms [4,8,11,
12] are that (1) tasks are independent from one another
and (2) they are designed only for homogeneous systems.
Although heterogeneous systems are considered in both
[17] and eFRCD, the latter considers fault-tolerance and
real-time tasks while the former does not consider either.
Very recently, Girault et al. proposed a real-time
scheduling algorithm for heterogeneous systems that
considers fault-tolerance and tasks with precedence
constraints [5]. This study is by far the closest to eFRCD
that the authors have found in the literature. The main
differences between [5] and eFRCD are three-fold: (a).
eFRCD considers heterogeneities in computation,
communication and reliability that will be defined shortly,
whereas the former only considers computational
heterogeneity. These hetero-geneities. (b). The former
does not take reliability cost into consideration, whereas
eFRCD is reliability-cost driven; and (c). The former
allows the concurrent execution of primary and backup
copies of a task while eFRCD allows backup copies of
tasks whose primary copies are scheduled on different
processors to overlap one another.
In the authors previous work, both static [14,15] and
dynamic [13] scheduling schemes for heterogeneous real-
time systems were developed. One similarity among these
algorithms is that the Reliability Cost Driven Scheme is
applied. With the exception of the FRCD algorithm [15],
other algorithms proposed in [13,14] cannot tolerate any
failure. In this paper, the FRCD algorithm [15] is
extended by relaxing the requirement that backup copies
of tasks be not allowed to be overlapped.
3. Workload and system characteristics
A real-time job with dependent tasks can be modelled
by Directed Acyclic Graph (DAG), T = {V, E},whereV=
{v
1
,v
2
,...,v
n
} is a set of tasks, and a set of edges E
represents communication among tasks. e
ij
=(v
i
,v
j
)
E
indicates a message transmitted from task v
i
to v
j
,and|e
ij
|
denotes the volume of data being sent. To tolerate
permanent faults in one processor, a primary-backup
technique is applied. Thus, each task has two copies,
namely, v
P
and v
B
, executed sequentially on two different
processors. Without loss of generality, it is assumed that
two copies of a task are identical. The proposed approach
also is applied when two copies of each task are different.
The heterogeneous system consists of a set P={p
1
,
p
2
,...,p
m
} of heterogeneous processors connected by a
network. A processor communicates with other processors
through message passing. A measure of computational
heterogeneity is modeled by a function, C: V
×
P
Z
+
,
which represents the execution time of each task on each
processor. Thus, c
j
(v
i
) denotes the execution time of v
i
on
p
j
.Ameasureofcommunicational heterogeneity is
modeled by a function M: E
×
P
×
P
Z
+
. Communication
time for sending a message e
sr
from v
s
on p
i
to v
r
on p
j
is
determined by w
ij
*|e
sr
|,where|e
sr
| is the communication
cost and w
ij
is the weight on the edge between p
i
and p
j
,
representing the delay involved in transmitting a message
of unit length between the two processors.
Given a task v
V, d(v), s(v) and f(v) denote the
deadline, scheduled start time, and finish time,
respectively. p(v) denotes the processor to which v is
allocated. These parameters are subject to constraints: f(v)
=s(v)+c
i
(v) and f(v)
d(v),wherep(v) =p
i
. A real-time
job has a feasible schedule if for all v
V, it satisfies both
f(v
P
)
d(v),andf(v
B
)
d(v).
A k-timely-fault-tolerant (k-TFT) schedule is defined as
the schedule in which no task deadlines are missed [12],
despite k arbitrary processor failures. The goal of eFRCD
is to achieve 1-TFT.
The reliability cost of task v
i
on p
j
is defined as the
product of failure rate,
λ
j
,ofp
j
and v
i
's execution time on
p
j
.Itshouldbenotedthatreliability heterogeneity is
Proceedings of the International Conference on Parallel Processing (ICPP’02)
0-7695-1677-7/02 $17.00 © 2002 IEEE

implied in the reliability cost by virtue of heterogeneity in
c
j
(v
i
) and
λ
j
.LetRC
0
(R,
Ψ
) and RC
i
(R,
Ψ
)(1
i
m) be
the reliability cost when no processor fails and when p
i
fails, where
Ψ
is a given schedule and R={
λ
1
,
λ
2
,…,
λ
m
}
is a set of failure rates for the processors. RC
0
and RC
i
are
determined by equation (1) and (2), respectively.
)1()(),(
)(
1
0
=
=
=Ψ
ivp
ii
m
i
P
vcRRC
λ
∑∑
===
==
+=Ψ
jvpjvpivp
jj
m
ijj
jj
m
ijj
i
PBP
vcvcRRC
)()(,)(
,1,1
)()(),(
λλ
+=
∑∑
===
=
jvpjvpivp
jjjj
m
ijj
PBP
vcvc
)()(,)(
,1
)()(
λλ
(2)
In equation (2), the first summation term on the right
hand side represents the reliability cost due to tasks whose
primary copies reside in fault-free processors, while the
second summation term expresses the reliability cost due
to the backup copies of the tasks whose primary copies
reside in the failed processor.
Reliability, given in the following expression, captures
the ability of the system to complete parallel jobs in the
presence of one processor permanent failure.
),(
),(
Ψ
=Ψ
RRC
eRRL
(3)
4. Scheduling algorithms
In this section, we present the eFRCD algorithm,
which has three objectives, namely, (1) total schedule
length is reduced so that more tasks can complete before
their deadlines; (2) permanent failures in one processor
can be tolerated; and (3) The system reliability is
enhanced by reducing the overall reliability cost of the
schedule.
4.1 An outline
The key for tolerating a single processor failure is to
allocate the primary and backup copies of a task to two
different processors such that the backup copy
subsequently executes if the primary copy fails to
complete due to its processor failure. Not all backup
copies need to execute, even in the presence of a single
processor failure. Since only tasks allocated to the failed
processor are affected and need their backup copies to be
executed, certain backup copies can be scheduled to
overlap with one another. More precisely, a v
B
is allowed
to overlap with other backup copies on the same
processor, if the corresponding primary copies are
allocated to the different processors to which the v
P
is not
allocated. Thus, in a feasible schedule, the primary copies
of any two tasks must not be allocated to the same
processor if their backup copies are on the same processor
and there is an overlap between two the backup copies.
This statement is formally described as below.
Proposition 1.
(
)
= )()(:,
B
j
B
iji
vpvpVvv
(
)
(
)
(
)
)()()()()()()()(
P
j
P
i
B
i
B
i
B
j
B
i
B
j
B
i
vpvpvfvsvsvfvsvs <<
Fig. 1 shows an example illustrating this case. In this
example, v
i
P
and v
j
P
are allocated to p
1
and p
3
,
respectively, and backup copies of v
i
and v
j
are both
allocated to p
2
. These two backup copies can be
overlapped with each other because at most one of them
will ever execute in the single-processor failure model.
The algorithm schedules tasks in the following three
main steps. First, tasks are ordered by their deadlines in
non-decreasing order, such that tasks with tighter
deadlines have higher priorities. Second, the primary
copies are scheduled. Finally, the backup copies are
scheduled in a similar manner as the primary copies,
except that they may be overlapped on the same
processors to reduce schedule length. More specifically,
in the second and third steps, the scheduling of each task
must satisfy the following three conditions: (1) its
deadline should be met; (2) the processor allocation
should lead to the minimum increase in overall reliability
cost among all processors satisfying condition (1); and (3)
it should be able to receive messages from all its
predecessors. In addition to these conditions, each backup
copy has three extra conditions to satisfy, namely, (i) it is
allocated on the processor that is different than the one
assigned for its primary copy, (ii) its start time is later
than the finish time of its primary copy plus the fault
detection time
δ
and (iii) it is allowed to overlap with
other backup copies on the same processor if their
primary copies are allocated to different processors.
Condition (i) and (ii) can be formally described by the
following proposition.
Proposition 2. A schedule is 1-TFT
(
)
(
)
δ
+ )()()()(:
PBBP
vfvsvpvpVv
.
4.2 The eFRCD algorithm
To facilitate the presentation of the algorithm,
necessary notations are listed in the following table.
Table 1. Definitions of Notation
Notation DEFINITION
D(v)
The set of predecessors of task v. D(v) = {v
i
|(v
i
,v)
E}
Fig. 1 Primary copies of v
i
and v
j
are allocated to
p
1
and p
3
, respectively, and backup copies of v
i
and v
j
are both allocated to p
2
. These two backup
copies can be overlapped with each other.
v
i
P
v
j
P
v
j
B
v
j
B
time
p
1
p2
p3
overla
p
Proceedings of the International Conference on Parallel Processing (ICPP’02)
0-7695-1677-7/02 $17.00 © 2002 IEEE

S(v)
The set of successors of task v, S(v) = {v
i
|(v,v
i
)
E}
F(v) The set of feasible processors to which v
B
can be
allocated, determined in part by Theorem 2.
B(v) The set of predecessors of v’s backup copy, determined
by Expression (7).
VQ
i
The queue in which all tasks are scheduled to p
i
, s(v
q+1
)=
,andf(v
0
) =0
VQ
i
’(v)
The queue in which all tasks are scheduled to p
i
,and
cannot overlap with the backup copy of task
v,where
s(v
q+1
)=
,andf(v
0
) =0
v
i
f
v
j
v
i
is schedule-preceding v
j
,ifandonlyifs(v
j
)
f(v
i
).
v
i
v
j
v
i
is message-preceding v
j
, ifandonlyifv
i
sends a
message to
v
j
. Note that v
i
v
j
implies v
i
f
v
j
but not
inversely.
v
i
a
v
j
v
i
execution-preceding v
j
, ifandonlyifbothtasksexecute
and v
i
v
j
Note that v
i
a
v
j
implies v
i
v
j
and v
i
f
v
j
EAT
i
P
(v) The earliest available time on p
i
for v
P
EA
T
i
B
(
v
)
Theearliestavailabletimeon
i
for v
B
.
EST
i
P
(v) The earliest start time for v
P
on processor p
i
.
EST
i
B
(v) The earliest start time for v
B
on processor p
i
.
A detailed pseudocode of the eFRCD algorithm is
presented below.
The eFRCD Algorithm:
Input: T={V,E}, P, C, M, R /* DAG, Distributed System,
Computational, Communicational and Reliability Heterogeneity */
Output: Schedule feasibility of T, and aviablescheduleΨ if it is
feasible
.
1. Sort tasks by the deadlines in non-decreasing order, subject to
precedence constraints, and generate an ordered list
OL;
2.
/* Schedule primary copies of tasks */
for each task v in OL, following the order, schedule v
P
do
.2.1s(v
P
)
←∞
;rc
←∞
;VQ
i
= ;
2.2
for each processor p
i
do /* Check if v can be allocated to p
i
*/
/*
Calculate EST
P
i
(v), where VQ
i
={v
1
,v
2
,…,v
q
}isthequeuein*/
/* which all tasks are scheduled to
p
i
, s(v
q+1
)=
,andf(v
0
) =0*/
2.2.1 /*Compute the earliest start time of v on p
i
*/
for (j=0toq+1) do
/* check if the unoccupied time intervals, interspersed */
/* by currently scheduled tasks, can accommodate
v */
if s(v
j+1
)-MAX{f(v
j
), EAT
i
P
(v)}
c
i
(v) then
EST
P
i
(v) = MAX{f(v
j
), EAT
i
P
(v)}; break;
end for
2.2.2 /* Determine the earliest EST
i
based on Equation (6) */
if v
P
starts executing at EST
P
i
(v) and can be
completed before
d(v) then
Determine reliability cost rc
i
of v
P
on p
i
;
/* Find the minimum reliability cost */
if ((rc
i
<rc) or (rc
i
=rcand EST
P
i
(v)< s(v
P
))) then
s(v
P
) EST
P
i
(v); p
p
i
;rc
rc
i
;
end for
2.3 if no proper processor is available for v
P
, then return(FAIL);
2.4 Assign
p to v, where the reliability cost of v
P
on p is the minimal;
VQ
i
VQ
i
+v
P
;
2.5 Update information of messages;
end for
3. /*Schedule backup copies of tasks */
for each task v in the ordered list, schedule the backup copy v
B
do
3.1 s(v
B
)
←∞
;rc
←∞
;
/*
Determine whether the v
B
should be allocated to processor p
i
*/
3.2 for each feasible processor p
i
F(v), subject to Proposition 2 and
Theorem 2, do /* identify backup copies already scheduled */
3.2.1
for (v
j
VQ
i
) do /* on p
i
that can overlap with v
B
*/
if (v
j
is a primary copy) or ((v
j
is a backup copy) and
(p(v
j
)=p(v))) then /* subject to Proposition 1 */
copy
v
j
into task queue VQ
i
’(v);
3.2.2 Determine if
v
P
is a strong primary copy (using Theorem 4);
3.2.3
for (all v
j
in task queue VQ
i
’(v)) do /*check the unoccupied */
/* time intervals, and time slots occupied by backup copies */
/* that
can overlap with v
B
, can accommodate v
B
*/
if s(v
j+1
)-MAX{f(v
j
), EAT
i
B
(v)}
c
i
(v) then
EST
B
i
(v)= MAX{f(v
j
), EAT
i
B
(v)}; break;
end for
3.2.4 /*Determine the earliest EST
i
based on Equation (9) */
if v starts executing at EST
B
i
(v) and can be completed
before
d(v) then
Determine reliability cost rc
i
of v
P
on p
i
;
/* Find the minimum
rc */
if ((rc
i
<rc)or(rc
i
=rc and EST
B
i
(v)< s(v
B
))) then
s(v
B
) EST
B
i
(v); p
p
i
;rc
rc
i
;
end if
end for
3.3 if no proper processor is available v
B
, then return(FAIL);
3.4 Find and assign
p
F(v) to v, where the reliability cost of v
B
on p
is the minimal; VQ
i
VQ
i
+v
B
;
3.5 Update information of messages;
3.6
for each task v
j
B(v) do /* avoid redundant messages */
v
j
sends message to v
B
if possible; (based on Theorem 1 and
Expression (7) )
3.7
for each task v
j
S(v) do /* avoid redundant messages */
if p(v
P
)
p(v
j
P
) or v
P
is not a strong primary copy then
v
B
sends message to v
j
P
if possible; (based on Theorem 3)
end for
return (SUCCEED);
4.3 The scheduling principles
Recall that EST(v) and EAT(v) are important to
determine a proper schedule for a given task v. While both
EAT and EST indicate a time when all messages from v's
predecessors have arrived, EST additionally signifies that
the processor to which v is allocated is now available for v
to start execution. In the following, we present a series of
derivations that lead to the final expressions for EAT(v)
and EST(v).
If only one of v’s predecessors v
j
D(v)is considered,
then the earliest available time EAT
i
(v, v
j
) for the primary/
backup copies of task v depends on the finish time f(v
j
),
the earliest message start time MST
ik
(e), and the
transmission time w
ik
*|e|, for message e sent from v
j
to v,
where p
k
=p(v
j
).Thus,
EAT
i
(v, v
j
)= f(v
j
) if p
i
= p
k
MST
ik
(e) + w
ik
*|e| otherwise (4)
Now consider all predecessors of v. Clearly v must wait
until the last message from all its predecessors has
arrived. Thus the earliest available time for v
P
on p
i
,
EAT
i
P
(v) is the maximum of EAT
i
(v, v
j
) over all the
predecessors.
{
}
),()(
)(
P
j
P
ivDv
P
i
vvEATMAXvEAT
j
=
(5)
Based on expression (5), EST
i
P
(v) on p
i
can be
computed by checking the queue VQ
i
to find out if the
processor has an idle time slot that starts later than task’s
EAT
i
P
(v) and is large enough to accommodate the task.
This procedure is described in step 2.2.1 in the algorithm.
EST
i
P
(v) is applied to derive EST
P
(v), the earliest start
time for v
P
on any processor. Expression for EST
P
(v) is
given below.
Proceedings of the International Conference on Parallel Processing (ICPP’02)
0-7695-1677-7/02 $17.00 © 2002 IEEE

{}
)()( vESTMINvEST
P
i
Pp
P
i
=
(6)
where
{}
×=×
=
jj
Pp
iii
vcMINvcPpP
j
i
λλ
)()(
,and
P'’={p
i
P| EST
i
P
(v) + c
i
(v) < d(v)}.
EST
B
(v),the earliest start time for v
B
, is computed in a
more complex way than EST
P
(v). This is because the set
of predecessors of v
P
,D
P
(v), contains exclusively the
primary copies of vs predecessor tasks, whereas the set of
predecessors of v
B
, B(v), may contain a certain
combination of the primary and backup copies of v’s
predecessor tasks. In order to decide B(v), it is necessary
to introduce the notion of strong primary copy as follows.
Note that there are two cases in which v
P
may fail to
execute: (1) p(v
P
) fails before time f(v
P
),and(2)v
P
fails to
receive messages from all its predecessors. Case (2) is
illustrated by a simple example in Fig. 2 where dotted
lines denote messages sent from predecessors to
successors. Let v
j
be a predecessor of v,andp(v)
p(v
j
).
Suppose at time t<f(v
j
P
), p(v
j
P
) fails, then v
j
B
should
execute. Suppose v
j
B
is not schedule-preceding v
P
, v
P
can
not receive any message from v
j
B
. Hence, even if p(v
P
)
does not fail, v
P
still can not execute. The primary copy of
a task that never encounters case (2) is referred to as a
strong primary copy, as formally defined below.
Definition 1. Given a task v, v
P
is a strong primary copy,
if and only if the execution of v
B
implies the failure of
p(v
P
) before time f(v
P
)). Alternatively, given a task v, v
P
is
a strong primary copy, if and only if no failures of p(v
P
) at
time f(v
P
)) imply the execution of v
P
.
Recall that one assumption is that only one processor
will encounter permanent failures, we observe that if v
i
is
a predecessor of v
j
, and the primary copies of both tasks
are strong primary copies, then v
i
B
is not message-
preceding v
j
B
. Fig. 3 illustrates a scenario of the case,
which is presented formally in the theorem 1 that is
helpful in determining the set of predecessors for a
backup copy (See step 3.6).
Theorem 1. Given two tasks v
i
and v
j
, v
i
is a predecessor
of v
j
. v
i
B
is not message-preceding v
j
B
, meaning that v
i
B
does not need to send message to v
j
B
,ifv
i
P
and v
j
P
are both
strong primary copies, and p(v
i
P
) p(v
j
P
).
Proof: Since v
i
P
and v
j
P
are both strong primary copies,
according to Definition 1, v
i
B
and v
j
B
can both execute if
and only if both v
i
P
and v
j
P
have failed to execute due to
processor failures. But v
i
P
and v
j
P
are allocated to two
different processors, an impossibility. Thus, at least one of
v
i
B
and v
j
B
will not execute, implying that no messages
need to be sent from v
i
B
to v
j
B
.
Let B(v)
V be the set of predecessors of v
B
.Itis
defined as follows.
B(v) = { v
i
P
|v
i
D(v)}
{v
i
B
|v
i
D(v)
(v
i
P
is not a strong primary copy
v
P
is not a strong
primary copy
p(v
i
P
) = p(v
P
))} = D
P
(v)
D
B
(v) (7)
In the eFRCD algorithm, the primary copy is allocated
before its corresponding backup copy is scheduled.
Hence, given a task v and its predecessor v
i
D(v),two
copies of v
i
should have been allocated when the
algorithm starts scheduling v
B
. Obviously, v
B
must receive
v
i
B
v
j
P
v
j
B
v
i
P
Fig. 2 Since processor p
1
fails,
v
i
B
executes.
Becuase v
j
P
can not receive message from v
i
B
,
v
j
B
must execute instead of v
j
P
.
p1
p4
p2
p3
time
Fig. 3 (v
i
,v
j
)
E, v
i
P
and v
j
P
are both strong
primary copies, and v
i
P
and v
j
P
are
scheduled on two different processors. v
i
B
is not execution-preceding v
j
B
.
p1
p4
p2
p3
v
i
P
v
j
P
v
i
B
v
j
B
time
Primary copy of v
i
Backup copy of v
i
Backup copy of v
j
Primary copy of v
j
Predecessor Successor
p
1
p2
p3
Fig. 4 (v
i
,v
j
)
E, v
i
B
is not schedule-preceding
v
j
P
and v
i
P
is a strong primary copy. v
j
B
can
not be scheduled on the processor on which
v
i
P
is scheduled.
v
i
P
v
i
B
v
j
P
v
j
B
v
j
B
time
v
i
B
v
j
B
Fig. 5
v
i
is the predecessor of
v
j
,
v
i
P
and
v
j
P
are
scheduled on the same processor, and v
i
P
is
the strong primary copy. In this case, v
i
B
is not
execution-preceding v
j
P
.
p1
p2
p3
v
j
P
v
i
P
time
Proceedings of the International Conference on Parallel Processing (ICPP’02)
0-7695-1677-7/02 $17.00 © 2002 IEEE

Citations
More filters
Journal ArticleDOI

Risk-resilient heuristics and genetic algorithms for security-assured grid job scheduling

TL;DR: This paper models the risk and insecure conditions in grid job scheduling, and proposes six risk-resilient scheduling algorithms to assure secure grid job execution under different risky conditions that can upgrade grid performance significantly at only a moderate increase in extra resources or scheduling delays in a risky grid computing environment.
Journal ArticleDOI

Scheduling security-critical real-time applications on clusters

TL;DR: A security overhead model is built that can be used to reasonably measure security overheads incurred by the security-critical tasks and incorporates the earliest deadline first (EDF) scheduling policy into SAREC to implement a novel security-aware real-time scheduling algorithm (SAEDF).
Book ChapterDOI

Performance implications of failures in large-scale cluster scheduling

TL;DR: A general failure modeling framework is developed based on recent results from large-scale clusters and then this framework is exploited to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies.
Journal ArticleDOI

A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems

TL;DR: This paper investigates an efficient off-line scheduling algorithm generating schedules in which real-time tasks with precedence constraints can tolerate one processor's permanent failure in a heterogeneous system with fully connected network.
Journal ArticleDOI

A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters

TL;DR: Results suggest that shortening scheduling times leads to a higher guarantee ratio, and if parallel scheduling algorithms are applied to shorten scheduling times, the performance of heterogeneous clusters will be further enhanced.
References
More filters
Book

Introduction to Probability

TL;DR: The text has tried to strike a balance between simplicity in exposition and sophistication in analytical reasoning, and ensure that the mathematically oriented reader will find here a smooth development without major gaps.
Book

Introduction to Probability

TL;DR: Reprint of entire volume Discrete probability distributions ( chapter 1) Continuous probability densities ( chapter 2) Combinatorics ( chapter 3) Conditional probability ( chapter 4) Important distributions and densities( chapter 5) Expected value and variance ( chapter 6) Sums of independent random variables ( chapter 7)
Book

Deadline Scheduling for Real-Time Systems: EDF and Related Algorithms

TL;DR: In this paper, the authors present the algorithms and associated analysis, but guidelines, rules, and implementation considerations are also discussed, especially for the more complicated situations where mathematical analysis is difficult, adding to the complexity is the fact that a variety of algorithms have been designed for different combinations of these considerations.
Journal ArticleDOI

Task allocation for maximizing reliability of distributed computer systems

TL;DR: A quantitative problem model, algorithms for optimal and suboptimal solutions, and simulation results are provided and discussed and the task allocation problem is addressed with the goal of maximizing the system reliability.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What is the reliability cost of a task when no processor fails?

Since only tasks allocated to the failed processor are affected and need their backup copies to be executed, certain backup copies can be scheduled to overlap with one another. 

To the best of their knowledge, the proposed algorithm is the first of its kind reported in the literature, in that it most comprehensively addresses the issues of fault-tolerance, reliability, real-time, task precedence constraints, and heterogeneity. 

This experimental result validates the use of the proposed FRCD and eFRCD algorithm to enhance the reliability of the system, especially when tasks either have loose deadlines or no deadlines. 

The computation of ESTi B(v) is more complex than that of ESTi P(v), due to the need to judiciously overlap some backup copies on the same processors. 

In addition to these conditions, each backup copy has three extra conditions to satisfy, namely, (i) it is allocated on the processor that is different than the one assigned for its primary copy, (ii) its start time is later than the finish time of its primary copy plus the fault detection time δ and (iii) it is allowed to overlap with other backup copies on the same processor if their primary copies are allocated to different processors. 

In addition, Fig.8 indicates that performability of FGLS increases much more rapidly with heterogeneity level than that of eFRCD, implying that FGLS is more sensitive to the change in computational heterogeneity than eFRCD. 

Given two tasks vi and vj, (vi, vj)∈ E, if viB is not schedule-preceding vj P, and vi P is a strong primary copy, then vj B and viP can not be allocated to the same processor. 

FGLS and FRCD require more computing resources than eFRCD, which is likely to lead to a relatively low SC when the number of processors is fixed. 

A measure of computational heterogeneity is modeled by a function, C: V×P→ Z+, which represents the execution time of each task on each processor. 

The FRCD and eFRCD algorithms have much better reliability simply because OV and FGLS do not consider reliability in their scheduling schemes while both FRCD and eFRCD take reliability into account.