Failure diagnosis using discrete-event models

doi:10.1109/87.486338

IEEE

TRANSACTIONS ON CONTROL

SYSTEMS

TECHNOLOGY,

VOL.

4, NO.

2,

MARCH 1996

105

Failure Diagnosis Using Discrete-Event Models

Meera Sampath,

Student Member,

IEEE,

Raja Sengupta, Stephane Lafortune,.

Member,

IEEE,

Kasim Sinnamohideen,

Member,

IEEE,

and Demosthenis

C.

Teneketzis,

Member,

IEEE

Abstruct-

Detection and isolation of failures in large, com-

plex systems is a crucial and challenging task. The increasingly

stringent requirements on performance and reliability of com-

plex technological systems have necessitated the development

of sophisticated and systematic methods for the timely and

accurate diagnosis of system failures. We propose a discrete-event

systems (DES) approach to the failure diagnosis problem. This

approach is applicable to systems that fall naturally in the class of

DES; moreover, for the purpose of diagnosis, continuous-variable

dynamic systems can often be viewed as DES at a higher level

of abstraction. We present a methodology for modeling physical

systems in a DES framework and illustrate this method with

examples. We discuss the notion of diagnosability, the construc-

tion procedure of the diagnoser, and necessary and sufficient

conditions for diagnosability. Finally, we illustrate our approach

using realistic models of two different heating, ventilation, and air

conditioning

(HVAC)

systems, one diagnosable and the other not

diagnosable. While the modeling methodology presented here has

been developed for the purpose of failure diagnosis, its scope is

not restricted to this problem; it can also be used to develop DES

models for other purposes such as control.

A

detailed treatment of

the theory underlying our approach can be found in a companion

paper

[27].

I.

INTRODUCTION

ETECTION and isolation of failures in large, complex

D

systems is a crucial and challenging task. Most practical

systems employ some means

of

fault detection, the sim-

plest

of

such schemes involving threshold logic, alarms, and

warning systems. The increasingly stringent requirements on

performance and reliability of complex technological systems,

however, have necessitated the development of sophisticated

and systematic methods for the timely and accurate diagnosis

of system failures. The problem of failure diagnosis has

received considerable attention in the literature of reliability

engineering, control, and computer science and a wide vari-

ety of schemes have been proposed. Failure diagnosis using

fault trees has been studied in detail by reliability engineers

[171,

[I

61, 1321, 181, [34]. Quantitative, analytical-model-based

methods have been extensively studied by control systems

researchers (see [lo], [33] and [35] and references therein;

also see [3] and [31]) while expert systems and model-

based reasoning schemes for diagnosis have been proposed

by computer scientists (see, e.g., 151, [201, [91, [71, [61, 1221,

Manuscript received May 16, 1994. Recommended by Assocaite Editor,

X.

Cao. This work was supported in part by

NSF

Grants ECS-9057967, ECS-

9312134, and ECS-9204419, with additional support from DEC and GE.

M. Sampath,

R.

Sengupta,

S.

Lafortune, and

D.

Teneketzis

are

with the

Department of Electrical Engineering and Computer Science, University

of

Michigan, Ann Arbor, MI 48109-2122 USA.

K.

Sinnamohideen is with Johnson Controls, Inc., Milwaukee,

WI

53201

USA.

Publisher Item Identifier

S

1063-6536(96)02070-2.

[ll], 1231, and [26]). A detailed discussion of several of

these methods has appeared in 1241. For a brief overview

of the salient features of the aforementioned methods, see

[28]. Recently, the problem of failure diagnosis has also been

studied in the framework

of

discrete-event systems (DES)

141, [141, [181, 1191, [291, [34]. In 1181 and [19], the authors

propose a state-based approach to diagnosability; they study

the problems of off-line diagnosis and on-line diagnosis where

the basic idea of the diagnostic procedure is to “test and

observe.” Extensions of the above work can be found in

[4] where the authors study testability of DES.

In

[14], the

authors present a template monitoring scheme based on timing

and sequencing relationships of events for fault monitoring in

manufacturing systems. In [34], the authors propose a Petri net

based method for failure diagnosis of manufacturing systems

which uses Petri net models for failure detection and fault

trees for failure isolation.

We propose in this paper and in the companion paper [27]

a DES approach to the failure diagnosis problem that expands

on the work in [29] and is different from the DES-based

approaches mentioned above. DES are characterized by a

discrete-state space of logical values and event driven dynam-

ics. Most large scale dynamic systems can be viewed as DES at

some level of abstraction. Hence, the proposed method of fault

diagnosis is applicable not only to systems that fall naturally

in the class of DES (communication networks and computer

systems, for instance), but also to systems traditionally treated

as continuous variable dynamic systems and modeled by

differential equations. One of the major advantages of the

proposed method is that it does not require detailed in-depth

modeling of the system to be diagnosed and hence is ideally

suited for the diagnosis of large complex systems like heating,

ventilation and air conditioning (HVAC) units, power plants,

and semiconductor processes. Other application areas include

automated manufacturing systems like automobile manufac-

turing where systematic diagnostic procedures are necessary

to check equipment integrity before they leave the production

line. Fig.

1

illustrates the overall system architecture which

contains in it a DES-based diagnostic subsystem. We assume a

two-level system architecture. At the lower level is the system

itself with its set of controllers; these low-level controllers

typically consist of equipment controllers and multivariable

controllers. The upper level consists

of

the supervisor, which

performs the tasks of control and coordination of the low-level

controllers, failure diagnosis, failure recoveryhystem recon-

figuration following failure identification, and coordination of

all

of

these subsystem operations. The interface between the

two layers conveys information

on

occurrences of observable

1063-6536/96$05.00

0

1996 IEEE

106

Observable Event

IEEE

TRANSACTIONS

ON

CONTROL

SYSTEMS

TECHNOLOGY,

VOL.

4,

NO.

2,

MARCH

1996

Type of Failure

DIAGNOSER

~

b

SUPERVISOR

COORDINATION

Commands

Observable events

I

INTERFACE

f

CONTROLLER(S)

I

Fig. 1.

The conceptual system architecture.

events in the system to the supervisor and communicates the

commands issued by the supervisor to the system.

Our approach to failure diagnosis involves two major steps:

developing a discrete-event model of the system to be diag-

nosed followed by construction of the diagnoser. The discrete-

event model that we develop captures both the normal and the

failed behavior of the system. The failures are modeled as

unobservable events and the objective is to infer about past

occurrences of these failures on the basis of the observed

events. The diagnoser is a finite-state machine

(FSM)

built

from the system model. This machine performs diagnosis

when it observes on-line the behavior of the system. The

diagnoser provides estimates of the state of the system after

the occurrence of every observable event. In addition, states

of

the diagnoser carry failure information and occurrences

of

failures can be detected (with a finite delay) by inspecting these

states. Fig.

2

illustrates the basic paradigm of our approach.

The top part of this figure shows the various steps involved

I

System

Model

and

Observations

I

Observer

-1

Estimate

of

Current System State

1

I

Inferencing

About

Past Failure Events

I

Potential Past Failures

d

Failure Identification

fi

I

Message

to

Coordinator

I

Fig.

2.

The

diagnostic

process.

in failure diagnosis; all these steps are to be performed by

the diagnoser, as shown in the bottom part of Fig.

2.

This

approach to diagnosis is appropriate for failures that involve

significant changes

in

the status of system components but do

not necessarily bring the system

to

a halt.

One of the main contributions of this paper is a pre-

cise methodology for modeling physical systems in a

DES

framework. The system is assumed

to

consist of several

distinct physical components and equipped with a set of

sensors. Starting from discrete-event models of the individual

components and from the discrete-valued sensor maps, we

present a systematic procedure for generating a composite

model which captures the interaction among the components

and also incorporates in it the sensor maps. This composite

model is the

DES

on which we perform diagnostics. While

this approach to modeling has been developed for the purpose

of diagnostics, its scope is not restricted to this problem; the

model building methodology presented here can be used to

develop

DES

models of any real system for other purposes

such as control.

Aside from the modeling methodology, the rest of the

theoretical developments underlying our approach to failure

diagnosis are presented in

[27].

In

[27]

we introduce two

related notions

of

diagnosability of a language generated by

a

DES.

The first definition, referred to as diagnosability,

SAMPATH

er

al.:

FAILURE DIAGNOSIS USING DISCRETE-EVENT MODELS

107

is more stringent than the second one, which we refer to

as I-diagnosability. Roughly speaking, a system is said to

be diagnosable if it is possible to detect, with finite delay,

occurrences

of

certain specific unobservable events, namely,

the failure events. In

[27]

we present a formal construction

procedure of the diagnoser followed by necessary and suffi-

cient conditions for diagnosability and I-diagnosability. These

conditions are stated on the diagnoser or variations thereof.

Thus, the diagnoser serves two purposes:

1)

on-line detection

and isolation of failures and

2)

off-line verification of the

diagnosability properties of the system.

In this paper, we restrict our attention to the notion

of I-diagnosability introduced in

[27].

Section I1 describes,

with illustrative examples, model building for diagnosis. In

Section 111, we present some

of

the main results of [27];

we review the notion of I-diagnosability, the construction of

diagnosers, and the necessary and sufficient conditions for

I-diagnosability. Next, we illustrate our approach to failure

diagnosis with two examples of HVAC systems. The DES

models of these systems, the corresponding diagnosers and

their analysis are presented in Section IV. In Section V, we

provide a brief comparison of the proposed method with some

of the other approaches to failure diagnosis mentioned earlier.

Finally, in Section

VI

we summarize the main results of this

paper.

11.

MODEL BUILDING

FOR

DIAGNOSIS

Suppose that the system to be diagnosed has

N

individual

components; typically, these components consist of equip-

ment and controllers. We first build DES models for these

components. Let

refer to the FSM (see, e.g., [25]) model of the ith component;

here

X,

is the state space,

C,

is the event set,

6,

is the

transition function, and

20,

is the initial state of G,. The states

in

X,

and the events in

C,

reflect the normal and the failed

behavior of the zth component. Some of the events in

E,

are

observable, i.e., their occurrence can be observed, while the

rest are unobservable. Typically, the observable events include

commands issued by the supervisor while the unobservable

events include failure events.

Next, we compose these individual models using the stan-

dard synchronous composition operation on state machines

(see, e.g.,

[

151). The synchronous composition procedure,

recalled below, is used to model the joint operation of two or

more

DES

given their individual

FSM

models. Consider two

discrete-event systems GI

=

(XI,

C1, 61,

ZOI)

and G2

=

(X2,

C2,

62,

1~02).

We denote by

e,(%)

the active event set

of

G,

at state

x,

i.e., the set of all transitions of G, defined

at state

x.

Let

G

=

(X,

C,

6,

z0)

denote the synchronous

composition of G1 and G2. Then

c

=

c1

U

C2

x

=

X1

x

X2

IC0

=

(201,

502)

Thus an event

U

which is common to both G1 and

Gz

is

possible at state (x1,

22)

of G only if

U

is in the active event

set of

GI

at x1 and in the active event set of Gz at

22.

In this

case, both systems

GI

and G2 are assumed to execute

o.

On

the other hand, if

o

is an event possible in G1 (Gz) and it

is not in

E2

(El), then only GI (Gz) executes the transition

0.

It is not difficult to see that the synchronous composition

procedure described above can be extended to model the joint

operation of any number of DES.

Let

G

=

(X,

2,

8,2i.o)

denote the synchronous composition of the component models

G;,

i

=

1,

+..

,

N.

Observe that we need only consider the

accessible part of

G.

G

then models the joint operation of

these components. Here

(3)

Given the set of

M

sensors of th_e system of interest, we

next identify the sensor maps

hj

:

X

-+

Yj,

j

=

1,

. .

.

, M

where

Yj

denotes the discrete set of possible outputs of the

jth sensor. Define

M

Y=ny,

(4)

j=l

and let

h:

X

-+

Y

denote the global sensor map defined as

follows:

h(z)

=

(h1(z),

hz(z),...,hnir(z)).

(5)

Finally, we transform

G

=

(X,

2,

8,

20)

to G

=

(X,-C,

6,

20)

with

xo

=

20

by redefining the trans_itions

of G

as

follows. Let

6(z,

U)

=

x’

where

x,

5’

E

X

and

2.

If

r

is observable (typically a command event), then

rename

U

in the transition as

(0,

h(z’))

and let

S(x,

(a,

h(x’)))

=

d.

The new event

(0,

h(x’))

is

observable in

C.

If

r

is unobservable and if

h(x)

=

h(x’),

then

o

is

left unchanged in

G

and

s(z,

o)

=

d.

The event

o

is

treated as unobservable in

C.

If

g

is unobservable and if

h(z)

#

h(x’),

then re-

place the transition

s“(~,

g)

=

x’

by the following two

transitions:

a)

6(z,

g)

=

z,,,

and

b)

S(xneW,

(h(z)

-+

h(5’)))

=

2’

I08

IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY,

VOL.

4,

NO.

2,

MARCH

1996

OPEN-VALVE, CLOSE-VALVE

VALVE

START-PUMP, STOP-PUMP

PUMP-FAILED-OFF-2

UMP

P~P-FAILED-ON-

IMP-FAILED-ON-2

START-PUMP. STOP-PUMP

PUMP

CONTROLLER

Fig.

3.

Component models

for

Example

2.1

where

x,,,

denotes a newly introduced state and

(h(x)

+

h(x’))

denotes the change in sensor readings corresponding

to states

x

and

2’.

The first transition

0

is unobservable in

C

while the second

(h(x)

4

h(z’))

is observable.

For the purpose of clarity, we henceforth denote all events

in the composite model

G

within braces,

(.

.

.).

Therefore the

event set

C

of

G

consists of composite events of the following

three types:

1)

(U,

h(x’)):

observable;

2)

(U):

unobservable; and

3)

(h(z)

4

h(x’)):

observable.

Let

X,,,

denote the set of all new states

x,,,

introduced

in Step

3)

above. Then

x

=

x

U

x,,,.

(6)

This completes the model building procedure for diagnosis.

The system to be diagnosed is now represented by the discrete-

event model

G

=

(X,

C,

6,

~0).

(7)

Note that the model

G

accounts for the normal and failed

behavior of the system. The observable events in this system

may be one of the following: commands issued by the super-

visor and sensor readings immediately after the execution

of

the above commands, and changes of sensor readings. The

unobservable events may be failure events or other events

which cause changes in the system state not recorded by

sensors.

We note at this point that the proposed approach to diagnosis

is not limited to the case of equipment and controller failures.

Sensor failures, too, can be handled in this framework by

simply treating the sensor

as

an additional component of the

system. In other words, we develop in addition to the equip-

ment and controller models, explicit discrete-event models,

which include both normal and failed states, for those sensors

that can fail.

We now present two examples to illustrate the above mod-

eling procedure. These examples also illustrate that in the

proposed framework, the modeling can be done at different

levels of granularity. In the first example, we model the

dynamic behavior of a system over its entire range of operation

including start-up and shutdown procedures. In the second

example, we model deviations from the steady state of a

system.

Example

2.1:

Consider an elementary

HVAC

system

consisting of a pump, a valve, and a controller. Fig.

3

depicts the individual component models

G,,i

=

1,2,3,

of the valve, pump, and controller, respectively.

The valve has four failure events:

STUCK-CLOSED-1,

SAMPATH

er

al.:

FAILURE

DIAGNOSIS

USING

DISCRETE-EVENT

MODELS

h(

POFF, VC,

)

=

NP, NF

h(

POFF, VO,

)

=

NP, NF

h(

POFF, SC,

)

=

NP, NF

h(

POFF,

SO,

)

=

NP, NF

h(

PON,VO,*)

=

PP,F

h(

PON,SC,*)

=

PP,NF

h(

PON,SO,*)

=

PP,F

-

109

h(

PFOFF, VC,

0)

=

NP, NF

h(

PFOFF, VO,

)

=

NP, NF

h(

PFOFF,

SC,

)

=

NP, NF

h(

PFOFF,

SO,

)

=

NP, NF

h(

PFON, VC,

)

=

PP, NF

h(

PFON,VO,*)

=

PP,F

h(

PFON,SC,*)

=

PP,NF

h(

PFON,SO,o)

=

PP,F

PON

PFOFF

PFON

vc

vo

sc

so

vc

so

vc

rli!

so

”!!

vc

vo

sc

so

Fig.

4.

Synchronous

composition

of

the

component models

for

Example

2.1.

C1

c‘1

(‘2

c2

e2

IC3

c‘q

C3

(‘3

(‘4

c‘4

c

‘4

c‘4

STUCK-CLOSED-2, STUCK-OPEN-1,

and

STUCK-OPEN-2.

The states

SC

and

SO

represent the stuck-closed and the

stuck-open status of the valve, respectively, while the

states VC and VO denote the closed-normal and open-

normal status, respectively. Likewise, the pump has four

failure events:

PUMP-FAILED-OFF-1, PUMP-FAILED-OFF-2,

PUMP-FAILED-ON-1,

and

PUMP-FAILED-ON-2.

The states

PFOFF and PFON represent the failed-off and failed-on status

of the pump while the states PON and POFF represent the

normally-on and off status. The only unobservable events in

this system are the failure events of the pump and the valve.

The system

G

in Fig.

4

is obtained by the synchronous

composition of the valve, pump, and controller models of

Fig.

3.

Both the accessible and the inaccessible states

of

the

system are shown in this figure. The inaccessible states are

subsequently dropped. Dotted lines in this figure indicate un-

observable events while solid lines indicate observable events.

For the sake of clarity, some of the events in this figure are

shown abbreviated. For instance, the event

STUCK-CLOSED-1

is shown

as

SC1, the event

PUMP-FAILED-ON-:!

as

PFON2,

and

so

forth.

Next, assume that there are two sensors in the system,

a

pressure sensor

on

the pump and

a

valve flow sensor. Let

Yl

=

{NP,

PP} and

Y2

=

{NF, F} denote the set of outputs

of the pressure sensor and flow sensor, respectively.

NP

and PP

denote no pressure and positive pressure, respectively, while

NF and

F

denote no flow and flow, respectively. Table

I

lists

the global sensor map

h.

Note that the map

h

is

defined only

for the accessible states of

G

in Fig.

4.

Also,

h

does not depend

on the state of

GJ,

the controller, which is indicated

in

the

table by the

OS.

The final composite model

G

is given in Fig.

5.

The shaded

circles in Fig.

5

denote the additional states

x,,,;

as

before,

observable events are indicated by solid lines and unobservable

events by dotted lines. The table in Fig.

5

lists the events in

Failure diagnosis using discrete-event models

Citations

Diagnosability of discrete-event systems

Failure diagnosis using discrete event models

Coordinated Decentralized Protocols for Failure Diagnosisof Discrete Event Systems

Fault diagnosis in discrete-event systems: framework and model reduction

Fault detection for discrete event systems using Petri nets with unobservable transitions

References

Detection of abrupt changes: theory and application

Fault diagnosis in dynamic systems using analytical and knowledge-based redundancy—a survey and some new results

The control of discrete event systems

Paper: A survey of design methods for failure detection in dynamic systems

Diagnosability of discrete-event systems

Related Papers (5)

Diagnosability of discrete-event systems

Introduction to Discrete Event Systems

Coordinated Decentralized Protocols for Failure Diagnosisof Discrete Event Systems

A polynomial algorithm for testing diagnosability of discrete-event systems

Polynomial-time verification of diagnosability of partially observed discrete-event systems