A block algorithm for the algebraic path problem and its execution on a systolic array

doi:10.1109/ARRAYS.1988.18067

A BLOCK ALGORITHM FOR THE ALGEBRAIC PATH PROBLEM

AND ITS EXECUTION ON

A

SYSTOLIC ARRAY

Fernando

J.

Nunez and Mateo Valero

Departamento de Arquitectura de Computadores

Facultad de lnformatica de Barcelona, Universidad PolitCcnica de Cataluna

Pau Gargallo 5,08028 Barcelona, SPAIN

This work was supported

by

the Ministry

of

Education and Science

of

Spain (CAICYTI under contract

PA85

-03

14.

ABSTRACT

The solution of the Algebraic Path Problem (APP) for arbitrarily sized graphs by a

fixed-size systolic array processor (SAP) is addressed. The APP is decomposed in two

subproblems.

A

SAP is designed for each one, both SAPs combined produce a highly

implementable versatile SAP (VSAP). The proposed VSAP has px p processing

elements (PES), solving the APP of an N-vertex raph in N3/p2

+

NWp

+

3p-2 cycles.

With slight modifications in the operations perkrmed by the PES the problem is

optimally solved in N3/p2

+

3p-2 cycles.

1.

INTRODUCTION

Important problems such as the Transitive Closure (TC), the All-Pairs Shortest

Path (SP), and the Gauss-Jordan Elimination are particular cases

of

the more general

APP.

Its

solution

is

a computingintensive task, it has cubic complexity with respect to

the size of the problem.

Using (N

+

1)2

PES in a hexagonally connected array the APP is solved in 7N cycles

in

[

11.

An array that solves the APP in 5N cycles with N

X

(N

+

1)

PES can be found in

[2].

In [3] a dependence graph based design method obtains new and already proposed

arrays for the APP. There, optimum arrays obtaining the result in

5N

cycles with

N

x N PES are proposed. In fact, many SAPs devoted to special cases of the APP can be

generalized for solving it. For example, SAPs for the TC and the

SP

[4],

[5],

[6],

[7].

There is always a computation

too

large for a given array. Mapping larger sized

computations into smaller arrays is of great ractical interest. The original problem is

decomposed into subproblems whose sizes &t the available SAP size. This paper is

devoted to decompose the APP and to design an array for solving them and combining

their partial results. APP instances like the TC have been partitioned [8],

[9],

but to

our knowledge there are no works about the partitioning

of

the APP in the literature.

In the next section the basics of the APP are briefly reviewed. Section

3

is devoted

to describe an iterative block algorithm for the APP. Then, in section

4

this algorithm

is rearranged for solving the APP using only two matrix operations, i. e., the APP is

decomposed into two subproblems. The design of a SAP for solving each subproblem,

and their combination to form the resulting VSAP is commented in section

5.

Block

VO

details, changes for achieving an optimal execution, and performance expressions

are given in section

6.

Finally, in section

7

the

outstanding points addressed in the

paper are discussed.

2.THE ALGEBRAIC PATH PROBLEM

Consider

a

weighted directed graph G= <V,E,w>, where

V

is

its N-vertex set,

E

NxN its edge set, and w:E-S an edge weighting function whose values are taken

from the set

S.

It belongs to a path algebra <S,+,X,*,O,l> to ether with two binary

operations, "addition"

+

:S

X

S-S,

and "multiplication"

X

:$X

S-S,

and a unary

operation called closure

()*:SAS.

Constants

0

and

1

belong to

S.

Hereby,

+

and

x

will

be used in infix notation, whereas the closure of an element a will be noted as

a*.

Additionally,

*

will have precedence over

X

and this over

+.

This algebra is a closed

semi-ring, i. e., an algebra fulfilling the following axioms

[lo]:

+

is associative.

CH2603-9/88/oooO/0265$01.00

0

1988

IEEE

265

266

commutative, with

0

as neutral;

X

is associative with

1

as

neutral, it distributes over

+

. Element

0

is absorptive with respect to

X.

The equality a*

=

1

+a

X

a*

=

1

+

a*X a

must hold.

The weight w(p) of a path p=(el,e2,

...,

em), eiCE, is defined as

w(P):

=

w(el)X w(e2)X

...

X

w(em)

.

The APP

is

the determination of the sum of weights

of all the possible paths between each pair of vertices (i

j).

If P(i

j)

is the set of all the

possible paths from i toj, the APP is to find the values

[lll:

Intemational Conference

on

Systolic

Arrays

d(ij):=

2

a(p)

We associate a weight matrix A={a(ij)}, lSij5N with graph G, where

a(i

j):

=

w((i

j))

if (i j)CE, and a(ij):

=

0

otherwise. This matrix representation permits

us to formulate the APP as the computation of a sequence of matrices AM

=

{a(ij)(k)}

01

k5 N. The value a(i

j)(k)

is the weight of all the possible paths from vertex i to j

with intermediate vertices v,

15

vi k. Initially, A(0): =A, then A*: =A(N), where

A*:={a(ij)*} is an NXN matrix satisfying a(ij)*=d(ij) if (ij)CP(ij), and a(ij)*:=O

otherwise.

The algorithms proposed to solve path problems formulated in matrix terms are

known as matrix-methods; most are referenced in [ll]. Among them we find

algorithms for the TC [121, [131; the

SP

[141; and the classic Gauss and Gauss-Jordan

eliminations to compute the inverse of a matrix.

It

was observed that the same program schemes could solve these problems. A

program scheme is a program with fixed control but where the sets over which the

variables take their values, and the meaning of the algebraic operations is left

uninterpreted

[

101. The majority of the program schemes useful for the APP come from

Linear Algebra, like Jacobi and Gauss-Seidel methods, and the mentioned Gauss and

Jordan eliminations [151. The Gauss-Jordan elimination has been recently introduced

for solving the APP [l]. The array described in

[2] is based on it.

There are many interesting r:blems that are specializations of the APP [15]. If

the algebra is boolean: S={O,l[

+"=OR,

"x"=AND, 0*=1, 1*=1,

"O"=O,

and

"O"=

+

-,

"1"=

0;

closure is the constant operation

0.

For matrix inversion:

S

=

R

,

"+"

and

"X"

are the ordinary addition and multiplication with respective neutrals

0

and

1.

The closure is a*= 1/(1-a) for a#

1.

p

E

P(ij)

"1"=1; the APP finds the TC. For the

SP:

S=[O,+m)U(+m},

"+"=min,

"

XI'

=

+

,

3.APP

BLOCK

DECOMPOSITION

Block algorithms are a useful tool for extracting parallelism from problems. For

example, when solving matrix roblems, both, input and output matrices can be split

into blocks or submatrices. Dikerent block subproblems can be allocated to different

processors. Processors must combine their partial results in order to attain the desired

global solution. This can be named as interblock parallelism.

Block algorithms are also used for solving problems in parallel using systolic

arrays. However, the approach is completely different. In this case, blocks must fit the

size of the available array. Subproblems are solved one after the other, with a proper

sequencing to combine their partial solutions efficiently. By doing

so,

the original

problem is solved. In the context of systolic computing, these are known as size-

independent algorithms, and problem decomposition is named partitioning

[

161,

[

171.

Systolic arrays explote intrablock parallelism. An array can be seen as a single

processor running a sequential algorithm, but operating on blocks instead of elements.

Normally, problem partitioning generates subproblems of different nature. For

example, the LU-decomposition is split into smaller matrix products with

accumulation, linear systems of equations, and LU-decompositions

[

181; the TC

requires solving smaller

TCs

and performing boolean matrix multiplications with

accumulation [9]. We are interested in solving all the resulting subproblems using one

sin learray.

Euppose that in the rocess of solving the APP by obtaining the matrix sequence

solve the APP for matrices with pXp elementsor smaller. The point is

how

to obtain

Ace),

...,

A(x), matrix A(k

P

has been already computed. Also assume we know how to

Algorithms

267

A(k+p) through block operations. The involved blocks are shown in figure

1.

Block A1

is a p

x

p block on the diagonal. Note that A2 and A3 have px (N-p) and (N-p)

x

p non-

zero elements respective1 while A4 has (N-p)

X

(N-p). Then, Theorem

1

indicates how

to obtain A(k

+PI

from A(k7;hrough block operations.

Operator

()*

represents the APP of a block,

+

and

X

are the natural extension to

matrices of addition and multiplication of the underlying algebra. Figure

2

illustrates

a block operation.

A4

A3

AI A2

k

P

N-p-k

Figure

1.

The blocks referenced

in Theorem

1.

Figure

2.

A

block operation illustrated.

Theorem

1:

Under the preceding assumptions, A(~+P) can be obtained from A(k) as

follows:

Al(k+p): =Al(k)* (3.a)

A2(k

+

p):

=Al(k)*

X

A2(k)

(3.b)

A3(k+p): =A3(k) XAl(k)* (3.c)

A4'k+p):=A4(k)

+

A3(k)XAl(k)*XA2(k) (3.d)

Proof:

A k-path in G is defined as a path whose intermediate vertices must belong to

the set {l..k}.With each matrix A(k)

we

can associate a graph G(k)= <V,E(k),w(k)>

whose edge weights are the sum of all the possible k-paths in G between each ordered

pair of vertices. A (k

+

p)-path in

G

is either a k-path (an edge in G(k)) or a path in G(k)

whose intermediate vertices are taken from the set {k

+

l..k

+

p}. An example of the

latter path is depicted in figure 3-a. If ij€{k+l..k+p} then (3.a) follows trivially.

Consider now (3.d), then ij€{l..N}-{k+l..k+p}. The general form of this path is

represented in figure 3-b. Note that a(vl,vm)(k+P), an element of Al(k)*, is the weight

of all the (k+p)-paths from vl to vm. Thus, the weight of this path from i to

j

is

a(i,vl)(k)

X

a(vl,vm)(k

+

P)

X

a(vmj)(k).

If

vertices are any in the specified sets, the

overall operations can be expressed in matrix terms as (3.d). The term A4W adds the

contributions of the k-paths. Cases (3.b) and (3.c) are simplifications of (3.d).0

It

is worth mentioning that block schemes for the APP can also we derived by

extending to blocks the properties fulfilled by elements, and then applying any valid

program scheme to blocks. Nevertheless, we have decided to present Theorem

1

because it is more general.

It

permits to design parallel algorithms using different

block sizes for the same problem. In addition, by employing graph instead of algebraic

terms, it provides insight in the effect of the underlying operations.

4.

TWO-PRIMITIVE APP PARTITIONING

As said, all subproblems must be solved by a single array. In this paper, we are

concerned with 2-D SAPS although the results also apply to one-dimensional

machines.

For

the moment, let us advance that the fixed-size array developed in the

next section has p

X

p processing elements

(PES).

268

Intemational Conference

on

Systolic

Arrays

i,jC(l..N}-(k+l..k+p)

vl,..,vmC(k+l..k+p}

(b)

a(i,vl)(k) a(vl,vm)(k

+

P)

a(vm,j)(k)

Figure

3.

(a)

A

(k

+

p)-path in

G('d

with intermediate vertices in

{k

+

l..k

+

p);

(b)

a general form of these paths.

Hence, it

is

natural to partition the NXN matrices to be computed into square

blocks with pxp elements. Without

loss

of generality, assume p divides N evenly. The

remaining of this section will deal with blocks, or groups of them. Some matrix

notaticn conventions are mandatory.

Suppose an NXN matrix A that is split into N/pXN/p blocks of size pxp. A(ij) is

one of these blocks placed at the i-th block-row, j-th block-column. The i-th block row,

and the j-th block column are respectively denoted as A(i,&) and A(&$. A(i,&-j) refers

to the i-th row without block A(i

j).

Under this convention, A(&-i,&-j) is obtained from

A, eliminating its i-th and j-th row and column. Subranges also need to be specified.

For instance, A(ij..k) is composed of blocks in the i-th row from column j to column k.

Now, we will present Algorithm

1,

that computes the APP of an NXN matrix from

the basic assumption that it is known how to obtain the APP of a pxp block.

Algorithm

1

is

a direct outcome of Theorem

1.

It is row oriented.

A

column oriented

version could be obtained by simply interchanging block indexes, and inverting the

order in which blocks are multiplied.

Algorithm

1

B(0): =A

fork:

=

1

to N/p

B(k,k)(k):

=

B(k,k)(k-l)* (4.a)

B(k,&-k)(k): =B(k,k)(k)

X

B(k,&-k)(k-l) (4.b)

B(&-k,&)(k):

=

B(&-k,&-k)(k-l)

+

B(&-k,k)(k-l) XB(k,&)(k) (4x1

end for

A*: =B(N/p)

Note that A(~P)

=

Blk), thus, after N/p iterations results A*

=

ACN

=

B(Y/p).

At this point, we have seen that the APP is decomposable into smaller APPs and

matrix multiplications (with or without accumulation), observing the rules of the

underlying algebra. The next stage is to modify Algorithm

1

for making it more

suitable of systolization. It is interesting to minimize the number of primitives

executable by the VSAP. Only two primitives, named

P1

and P2, suffice for the APP.

Let us define them. Consider three matrices

X,

Y,

Z

with sizes pxp, pxm, and pxm

respectively. WedefinePl andP2as:Pl(X,Y)

=

X*XY;andPP(X,Y,Z)

=

XXY+Z.

Algorithm

2

fork:

=

1

to N/p

Algorithm 2 solves the APP using only

P1

and P2.

B(0): =A

B(k,&)(k):

=Pl(B(k,k)(k-l),B(k,l..k-l)(k-1)IIplB(k,k

+

l..N)(k-l)) (4.d)

fori:

=

1

to

Nip with i

z

k

end for

B(i,&)(k): =P2(B(i,k)(k l),B(k,&)(kI,B(i,&-k)(k

1))

(4.e)

end for

A*:

=

B(N/p)

Algorithms

269

Lines (4.a) and (4.b) obtain A(k,&)(k). Using

P1

they can be grouped into line

(4.d),where the symbol

I

denotes matrix juxtaposition, and Ip is the pxp identity

matrix. On the other hand, line (c) is equivalent to:

allowing to use P2 for computing A(i,&)(k) in line (4.e).

5.

A

VERSATILE

SAP

FOR

THE

APP

When designing versatile SAPs, the additional complexity caused by partitioning

must be minimized in order to attain an economic and implementable machine. More

specifically, it is mandatory to reduce the overhead caused by partitioning in the

external hardware and communications, as well

as

in the required control. The

additional delays, and PE complexity have to be minimized too [19].

Intuitively, the VSAP can be obtained by overlapping, in some sense, two SAPs,

one for

P1

and the other for P2. Moreover, both candidate SAPs must be as similar

as

possible. An efficient sequencing between consecutive subproblems also has direct

influence on performance. Most of these points are taken into account in papers

dealing with the mapping of partitioned systolic algorithms into fixed-size arrays.

In order to attain

a

highly implementable design, additional constraints could be

observed. For example, by forcing equal input and output data formats, there is no

need for a data rearranging network between the VSAP and the external memory

modules. In other words,

a

single storage scheme can be used for all the matrices.

Another interesting restriction is to have memory modules only along one side of

the SAP'S polygon. This fact avoids having different algorithms to perform the same

operation over matrices entering the array from different points.

Algorithm

2

has been designed to use the same block in a (maybe long) sequence of

operations. Hence, in order to be used, this block is preloaded in the VSAP. This helps

to reduce the internal and the external

U0

bandwidth requirements. All these

constraints have influenced the design of the VSAP.

The Jordan algorithm for the quasi-inversion

of

a

matrix can be extended for

computing

P1

[15]. The following equation must hold: A* =A

x

A*

+

Ip; then by

postmultiplying both sides by B we can write: Pl(A,B) =A x Pl(A,B)

+

B. In the latter

equation Pl(A,B) is the unknown. The Jordan algorithm finds

it

by computing a

sequence of matrices A(k), B(k), and M(k),

15

k5

p,

with respective dimensions p

x

p,

pxm, and pxp. Initially A(o):={a(ij)(o)}=A

,

and B(o):={b(ij)(o)}=B

;

then

AM:

=

M(k)

x

A(k-1)

;

BW:

=

M(k)

x

B(k-1)

;

15

k5

p.

MW is obtained from the k-th

column of A(k-1). Figure 4 shows its strbcture. It can be shown that

B(P) =A*

x

B

=

Pl(A,B).

k-th

column

Figure

4.

The structure

of

matrix

Mk).

A block algorithm for the algebraic path problem and its execution on a systolic array

Citations

Blocked All-Pairs Shortest Paths Algorithm for Hybrid CPU-GPU System

Closed semiring connectionist network for the Bellman-Ford computation

A family of efficient regular arrays for algebraic path problem

A binary relation inference network for constrained optimization

Spiral systolic architecture/algorithm for transitive closure problems.

References

Algorithm 97: Shortest path

A Theorem on Boolean Matrices

VLSI Array processors

Graphs and Algorithms

Linear and combinatorial optimization in ordered algebraic structures

Related Papers (5)

A systolic array algorithm for the algebraic path problem (shortest paths; Matrix inversion)

Systolic solution of the Algebraic Path Problem

Systolic parallel processing

A systolic array architecture for implementing a fast parallel decoding algorithm of one-point AG codes

Trends in systolic and cellular computation

Trending Questions (1)