Maintaining bipartite matchings in the presence of failures

doi:10.1002/NET.3230230503

Maintaining Bipartite Matchings in the

Presence

of

Failures*

Edwin Hsing-Mean Sha

Department

of

Computer

Science

&

Engineering, University

of

Notre

Dame,

Notre Dame,

Indiana

46556

Kenneth Steiglitz

Department

of

Computer

Science,

Princeton University, Princeton,

New

Jersey 08544

We present an on-line distributed reconfiguration algorithm for finding a new maximum matching

incrementally after some nodes have failed. Our algorithm is deadlock-free and, with

k

failures,

maintains at least

M

-

k

matching pairs during the reconfiguration process, where

M

is the size of

the original maximum matching. The algorithm tolerates failures that occur during reconfiguration.

The worst-case reconfiguration time is

O(k

min(lAl,161)) after kfailures, whereA and

6

are the node

sets, but simulations show that the average-case reconfiguration time is much better. The algorithm

is also simple enough to be implemented

in

hardware.

0

1993

by

John

Wiley

&

Sons,

Inc.

1.

INTRODUCTION

Imagine that there are

n

persons in Village

A

and

m

in

Village

B.

Two persons from different villages can be

matched to become a couple, and at any time, only one

person can be matched to another. Initially, the match-

ing is maximum. Sometimes, however, people decide

to be alone. Without

loss

of generality, assume that

some in

B

change their minds. Let

G

=

(A,

B,

E)

be a

bipartite graph and

(A(

=

n,

(B(

=

m.

An edge between

two nodes means that they are allowed to become a

couple. After

a

person

b

has changed his

or

her mind,

b’s

original matching in

A

must find another available

one in

B,

if possible.

The

process

of finding a new matching to obtain the

maximum number of pairs

is

called

reconfiguration.

Unfortunately, there is no central agency to perform

*This

work was supported

in

part

by

NSF

Grant

MIP-8912100,

and

U.S.

Army

Research Office-Durham

Grant

DAAL03-89-K-

0074.

the reconfiguration process,

so

this process must be

done in a distributed and parallel way. It

is

also desir-

able that, during the reconfiguration process, as many

matched pairs be maintained as possible and that fail-

ures during the process be tolerated. Ideally, there

should always be at least

M

-

k

matching pairs after

k

persons have changed their minds, where

M

is the

original number of matching pairs. The number of

matching pairs should monotonically increase in the

reconfiguration process. Therefore, if no new persons

change their minds, the reconfiguration process will

finally regenerate a new maximum matching, if one is

possible.

One motivation for this problem is that such an al-

gorithm can be applied to any fault-tolerant system

that involves bipartite matching.

For

example, Kuo

and Fuchs

[5]

showed that many problems of spare

allocation in

VLSI

arrays can be modeled as bipartite

matching. Based on our bipartite matching algorithm,

we can have

a

distributed reconfiguration mechanism

to replace faulty nodes by spare nodes in a redundant

NETWORKS,

Vol.

23 (1993) 459-471

0

1993 by

John

Wiley

13

Sons,

Inc.

ccc

0028-3045/93/050459-13

459

460

SHA

AND

STElGLln

array. In

[9,

101,

highly reliable structures

with

the

asymptotically optimal number of nodes and edges for

one-dimensional and treelike array architectures were

given. They used bipartite matchings between levels

in

layered graphs and

so

these are particularly well suited

for the run-time-tolerant algorithm described

in

this

paper.

The general matching problem has been extensively

studied. For maximum matching in bipartite graphs,

the algorithm of Hopocroft and Karp

[3]

is

the fastest

known, and the algorithm by Micali and Vazirani

[6]

is

the

most efficient one for finding matchings

in

general

graphs. More recently, an algorithm for on-line bipar-

tite matching was presented

[4].

Some papers

[H,

Ill

also gave distributed algorithms for maximum match-

ing

in

general graphs.

Our

problem is different from the usual matching

problem, which starts with

an

empty matching. We

assume that we start with a maximum matching, and

after some nodes fail, we would like to have a simple,

efficient, and distributed way to find a new maximum

matching. Further, the algorithm should start to recon-

figure the system as soon as failures occur, even

though new failures may occur during the reconfigura-

tion process. We say a reconfiguration algorithm is

on-

line

if

it

can start to reconfigure the system immedi-

ately after

a

failure occurs and can endure new failures

during reconfiguration. This

is

an especially desirable

property for run-time fault tolerance, since the system

need not stop

to

do

a

reconfiguration process.

We

will

not be concerned

so

much with the number

of messages that

PEs

need

to

send to achieve a new

matching, such

as

is done

in

the matching algorithms

in

[8,

111,

which,

in

any event, are not designed to

operate

in

the presence of faults. Rather, we want to

minimize the effects of failures during reconfiguration.

Our algorithm does tolerate faults during operation

and ensures that after

k

failures there are always at

least

M

-

k

matching pairs, where

M

is the original

number of matching pairs.

If

there are no further fail-

ures, the size of the matching grows monotonically

until

it

becomes maximum. The algorithm is simple

enough to be implemented in hardware. The overall

reconfiguration time is

O(k

min(lA1,

\El))

after

k

fail-

ures. The simulation results show that the average-

case reconfiguration time is much better.

2.

THE BASIC IDEAS

OF

OUR

ALGORITHM

We

first explain our model: An array architecture

is

represented by

a

graph

G;

each node of

G

is

regarded

as a processor, and each edge as a connection between

two processors. If nodes have failed, the failed nodes

and

all

the edges incident

to

them will be removed.

If

later

a

failed node is repaired, this node with the corre-

sponding edges will be added to the graph. We assume

that

if

two nodes have not failed, and are connected,

they can communicate, i.e., we do not model failures

of communication.

Definition

2.1.

Given

a

bipartite graph

G

=

(A,

B,

E),

a

matching

M

is

a subset of the edges

such

that

no

two

edges

in

A4

share the same end node.

Definition

2.2.

If an edge

(a,

b)

is

in

M,

we say that

a

is

b's

matching

node

in

M

or vice versa. This pair

(a,

b)

is

also called matching pair or a matching edge. If no

edge

in

M

is connected

to

node

x,

we say

x

is a

free

node.

Definition 2.3.

A matching is

maximum

if

no other

matching of

G

contains more edges. Given a matching

M,

an

alternatingpath

P

is a path that does not contain

two consecutive edges that are not

in

M.

If

an alternat-

ing path

P

starts and ends at free nodes,

it

is an

aug-

menting path.

It

is well known that

M

is not a maximum matching if

and only if there is an augmenting path. Our algorithm

searches for augmenting paths to obtain the maximum

matching of

G.

After some nodes have failed, the search for aug-

menting paths to find free nodes will traverse

the

graph. Basically, our algorithm performs a depth-first

search for finding free nodes.

In

this section, we de-

scribe our algorithm informally. A formal description

of our algorithm

is

given in the next section. Let

G

=

(A,

B,

E)

be

a

bipartite graph. We think

of

sets

A

and

B

as two levels of nodes

in

a bipartite graph. Initially, we

assume

that

a maximum matching already exists. An

initial maximum matching can be obtained from our

algorithm

in

the following way: Initially, every node in

A

regards its matching node as failed and starts to

run

the bipartite matching algorithm. We assume that a

failure of a matched node can be detected by its cur-

rent matching node.

Nodes in both

A

and

B

can fail. For failures

in

B

(resp.,

A),

nodes in

A

(resp.,

B)

will search for free

nodes. We have two versions of our algorithm: Ver-

sion

A

is for failures

in

B

and Version

B

is for failures

in

A. These versions are the same except

A

and

B

are

interchanged. However, if our algorithm is to be used

as a reconfiguration algorithm for the layered fault-

tolerant structure in

[

101,

we only need the Version A

because each layer can be regarded

as

level

A.

Let

a

be

a

matched node in

A

and

b

be

a's

matching

node

in

B.

If node

b

fails, Version

A

of

our

reconfigura-

tion starts

at

node

a.

Node

a

becomes what

we

call

a

supernode

because

it

has

the

privilege of choosing

a

MAINTAINING BIPARTITE MATCHINGS

461

(b)

s

Super Node

-

Current matching edge

Fig.

1.

The figure for passing supernodes.

good node to be its matching node. If a node

in

A

fails,

the matching node of this failed node

will

become a

supernode to initiate Version

B

of our reconfiguration

algorithm. These two versions of our algorithm are

performed independently to obtain a maximum match-

ing. In this section, without

loss

of generality, we only

explain Version

A.

However, we need to show that the

failures

in

A

do not affect the correctness of the Ver-

sion

A.

Here, we explain what the actions a supernode

a

will

do.

First, supernode

a

tries to find a free node

in

B

that

is

connected to

a.

If

this node is available,

it

becomes

a's

matching node. Otherwise, supernode

a

will

try to

steal a node that is already matched to another node

in

A.

For example,

in

Figure

1,

after node

b

fails,

a

be-

comes a supernode. Since there is no free node con-

nected to

a,

a

will steal node

b'

that was matched to

a'.

Definition

2.4.

If a supernode

x

chooses

a

node

y

that

has been matched to

xr

to be its new matching node,

we say that

x

steals

y

from

XI.

After

b'

has been

stolen

by

a,

node

a'

will become a

supernode because

a'

does not have a matching node.

We can

think

of this process as the token of

supernode

traversing the path from node

a

to node

a'

[Fig. l(b)].

A

root node

is

a

node that initiates a search process for

finding a new matching after its matching node be-

comes faulty. The root node

is

the first supernode

in

a

search process. There may be several searches going

on simultaneously, each having a root node.

Our

algorithm does a depth-first search (DFS) for

finding augmenting paths

[7].

The process of searching

can be represented as a search tree called an

alternat-

ing

tree.

A

typical alternating tree is shown

in

Figure

2.

Each root node is the root of an alternating tree, and

at any time, a supernode is associated

with

the node

that is performing DFS

in

a tree. There will be pre-

cisely one supernode

in

each alternating tree.

A

new

matching is found when a supernode acquires a free

node. To prevent cycles

in

searching, we can simply

store a bit

in

each node

b

to

indicate if

it

has been

Current

-

matching

edge

x

Fig.

2.

An example of alternating tree.

reached.

We say that this node is marked

reached.

When a supernode finds a free node,

this

supernode

sends messages to unmark the corresponding nodes,

as explained later

in

this section.

If a supernode at a particular point cannot find an

adjacent free node, and finds that all the adjacent

nodes are marked

reached

(either

by

this tree search

or

some other), it backtracks immediately. Under

backtracking, some supernodes may backtrack to root

nodes, and these supernodes remain there

in

an idle

state. Thus, we need a way to reactivate when some

other supernodes find free nodes. After a supernode

has found a free node, this supernode sends a mes-

sage, called

UNMARK-BACKTRACK,

recursively to

unmark all the nodes that have been passed through by

a backtracking supernode along an alternating path.

For

example,

in

Figure

3

there are two idle

su-

pernodes,

S1

and

S2.

After

S3

has found a free node,

S3

will send the message,

UNMARK-BACKTRACK

to wake

up

the idle super nodes

Sl

and

S2.

Versions

A

and

B

of

our

algorithms are performed

alternatively. In each version, there are three phases

as shown in Figure

4.

Every node performs the same

52

idle

0

\

A

B

0

s3

has found an

A

/

unmatched node

b

O'b

Fig.

3.

An

example

of

breaking idleness.

B

462

SHA

AND

STEIGLI’TZ

Version

A

Version

B

Version

A

Version

B

Fig.

4.

A

running sequence

of

our algorithm; each version has

three

phases.

phase

in

the same version. Therefore, we need to syn-

chronize all the nodes to perform the same version and

the same phase.

Our

possible implementation is to use

common wires connected to every node. Because we

consider

our

algorithm to be performed

in

tightly cou-

pled processor arrays, few wires connected to every

node

(PE)

are practical assumptions. We can assume

there are three signal wires connected to every node

(PE).

Wire

wCLOCK

is the clock wire to synchronize the

phases of a clock. Wires

wA

and

wE

are to indicate

which version is running. When

wA

(resp.,

M’~)

is high,

Version

A

(resp.,

B)

is running. If we do not want to

use these common wires, we can use more compli-

cated message passing protocol for synchronization

[I].

3.

OUR RECONFIGURATION ALGORITHM

In

this

section, we explain our algorithm. Since Ver-

sions

A

and

B

are essentially the same, we only

present Version

A

in

this section. First, we define

some terms for Version

A

of

our

algorithm:

Definition

3.1.

The node

old(n)

is

n’s

original match-

ing node before the reconfiguration, and the node

cur(n)

is

n’s

current matching node during reconfigura-

tion.

Initially, for every node

n,

we set

cur(n)

=

old(n).

In

our

algorithm, there are several attributes for

nodes

in

A

and

B,

which are used and set during the

operation of the algorithm. First, any node is

good

if

it

has not malfunctioned. The attributes of a node

b

in

B

are summarized as follows:

A

node

b

E

B

is

free

if

it

has no matching node under the current

matching,

reached

if

it

has been reached by some DFS

in

our

algorithm. When a node

is

not reached, we say that

this

node is

unreached.

The attributes of a node

a

E

A

that is reached by some

search process can be marked by message passing as

follows: Node

a

E

A

is

super

if

cur(a)

is not good,

or

it

is unmatched be-

cause its matching node

cur(a)

has been stolen by

some other node;

backtracked

if a search that reaches node

a

finishes

searching node

a’s

subtree and must backtrack to

a’s

parent.

We call a node super

if

and only

if

it

has

a

su-

pernode token.

This token can be transferred to other

nodes along the DFS traversed

in

our

algorithm. Mes-

sages need

to

be passed in

our

algorithm for changing

the current

states

of

nodes

a

E

A.

There are three

messages

that

can be sent:

SUPERNODE,

UNMARKJACKTRACK,

and

CHANGE-OLD-

MATCHING.

We discuss these three messages one by

one as follows:

1.

The message

SUPERNODE

represents the

su-

pernode token. If node

a

receives the message

SUPERNODE,

a

becomes the supernode. There

are two situations when a node

a

sends this mes-

sage. The first situation is when node

a

steals some

other’s matching node. The second situation

will

be

explained later in the section (see Fig.

5).

2.

After a supernode

s

has found a free and good node

in

B,

s

will send the message

UNMARK-

cur(b)

%matching

sunds

9

super

node

for

Version

oh

matchinn

01

matching

&

(a)

Fig.

5.

A

failure

in

A

that

is

in

an active alternating path.

MAINTAINING BIPARTITE MATCHINGS

463

BACKTRACK

to all the backtracked nodes that are

adjacent to node

old(s).

This message is used to set

some nodes in

B

as

not

reached

so

that some idle

supernodes can start

to

search for free nodes.

When

a

node

a

E

A

receives the message

UNMARK-BACKTRACK,

a

will set node

old(a)

as

unreached.

Then, after

a

sends

UNMARK-BACKTRACK to

old(a),

old(a)

will

immediately send this message to all the back-

tracked nodes that are adjacent to old@).

3.

When a supernode finds

a

free and good node

in

B.

this supernode will send the message

CHANGE-

OLD-MATCHING

to

the

nodes in

the

alternating

path

so

that their old matching nodes are set to be

the current matching nodes. When

a

node

a

gets the

message CHANGE-OLD-MATCHING, node

a

will

mark

the node

old(a)

as unreached and ask

old@)

to send UNMARK-BACKTRACK to all the

backtracked nodes that are adjacent

to

old(a).

Our algorithm runs in parallel at all

the

nodes. Ini-

tially, there

is

a

bipartite maximum matching. In Phase

1,

each node checks

if

it

needs to initiate a searching

process because of the failure of its current matching

node.

The real search process

is

performed in Phase

2.

If

the supernode

a

is successful in finding a free and

good node,

a

sends messages CHANGE-OLD

-MATCHING and UNMARK-BACKTRACK as we

explained previously. If node

a

cannot find a free

node, node

a

will try to steal others’ matching nodes.

The supernode

a

will steal an unreached and good

node

6,

and send the message SUPERNODE to node

cur(b).

Otherwise,

if

all

a’s

adjacent nodes have been

marked

reached

and the node

old(a)

is good,

a

will

backtrack. Node

a

will retain its old matching node

and send SUPERNODE to node

cur(old(a)).

Other-

wise, if the supernode token has backtracked to a root

node, this supernode token will wait there.

In Phase

3,

node

a

will do the appropriate opera-

tions depending on which message

a

has received. If

there are failures in

A,

their corresponding old match-

ing nodes become supernodes. We will explain the de-

tails later. In Version

A,

a supernode

in

A

should not

steal any supernode in

B,

since these supernodes

in

B

will start their searches later

in

Version

B.

Denote by

N

the node that is performing the following algorithm.

The following

is

a sketch of Version

A

of our algorithm

that runs at

all

the nodes in

A

in

parallel. A more

detailed algorithm is presented in the Appendix.

/*

Let set

E

be the set of nodes

in

B

which are good,

adjacent to

N,

and not supernodes.

*I

Phase

1

If

cur(N)

is

not good,

N

is

a

supernode.

Phase

2

If

N

is a supernode

If there

exists

a free node

in

E

Set

old(N)

to

be not reached

Ask

old(N)

to send CHANGE-OLD-

MATCHING to

cur(old(N))

Ask

old(N)

to send UNMARK-BACKTRACK

to all adjacent backtracked nodes

Else if

N

can steal an unreached node

b

in

E

Ask

b

to

send SUPERNODE to

cur(b)

Else if

old(N)

is

good

backtrack from

N

Else

Do nothing

Phase

3

If

N

receives SUPERNODE

Set

N

to be a

supernode.

If

N

receives CHANGE-OLD-MATCHING

Set

old(N)

to be not reached

Ask

old(N)

to send CHANGE-OLD-

MATCHING

to

cur(old(N))

Ask

old(N)

to send UNMARK-BACKTRACK

to

all

adjacent backtracked nodes

Set

old(N)

to be not reached

Ask

old(N)

to send UNMARK-BACKTRACK

to

all

adjacent backtracked nodes

If

N

receives UNMARK-BACKTRACK

We would like

to

discuss the operations that nodes

in

B

perform

in

Version

A.

We need to define the

following terms:

Definition

3.2.

A supernode

a

is called

idle

if

node

a

is

a supernode and every adjacent node of

a

is labeled

reached,

and

old(a)

is

not good; otherwise,

a

su-

pernode is called

active.

We say an alternating path is

active

if

the

corresponding supernode is active.

In Version

A,

nodes in

B

basically perform the mes-

sage passing for nodes in

A.

However, when there are

failures

in

A,

the old matching nodes of these failures

become supemodes. These supernodes in

B

do not

perform any search while the algorithm is running Ver-

sion

A,

but they need to

do

some operations for nodes

in

A.

There are two cases for failure of a node

a

in

A:

Either

a

is

not in an active alternating path or

a

is.

Let

b

be the old matching node of

a.

If

a

is

not

in an active

alternating path,

b

will do nothing except become

a

supernode for Version

B.

If

a

is in

an

active alternating

path as Figure 5(a) shows,

b

becomes a supernode

and initiates

a

backtracking to

cur(b)

[b

sends

SUPERNODE

to

cur(b)].

This backtracking

is

to re-

store the alternating path. We can regard this original

Maintaining bipartite matchings in the presence of failures

Citations

Subtyping recursive types modulo associative commutative products

References

An $n^{5/2} $ Algorithm for Maximum Matchings in Bipartite Graphs

An O(v|v| c |E|) algoithm for finding maximum matching in general graphs

An optimal algorithm for on-line bipartite matching

Complexity of network synchronization

Configuration of VLSI arrays in the presence of defects

Related Papers (5)

A parallel algorithm for reconfiguring a multibutterfly network with faulty switches

Fault-tolerant deployment with k-connectivity and partial k-connectivity in sensor networks

Broadcasting with linearly bounded transmission faults

Minimum power assignment in wireless ad hoc networks with spanner property

Broadcasting Algorithms for the Star-Connected Cycles Interconnection Network