scispace - formally typeset
Open AccessProceedings ArticleDOI

A block algorithm for the algebraic path problem and its execution on a systolic array

F.J. Nunez, +1 more
- pp 265-274
TLDR
The proposed SAP has p*p processing elements (PEs) solving the algebraic path problem (APP) for arbitrarily sized graphs by a fixed-size systolic array processor in N/sup 3//p/Sup 2/+N/sup 2//p+3p-2 cycles.
Abstract: 
The solution of the algebraic path problem (APP) for arbitrarily sized graphs by a fixed-size systolic array processor (SAP) is addressed. The APP is decomposed into two subproblems, and SAP is designed for each one. Both SAPs combined produce a highly implementable versatile SAP. The proposed SAP has p*p processing elements (PEs) solving the APP of an N-vertex graph in N/sup 3//p/sup 2/+N/sup 2//p+3p-2 cycles. With slight modifications in the operations performed by the PEs, the problem is optimally solved in N/sup 3//p/sup 2/+3p-2 cycles. >

read more

Content maybe subject to copyright    Report

A BLOCK ALGORITHM FOR THE ALGEBRAIC PATH PROBLEM
AND ITS EXECUTION ON
A
SYSTOLIC ARRAY
Fernando
J.
Nunez and Mateo Valero
Departamento de Arquitectura de Computadores
Facultad de lnformatica de Barcelona, Universidad PolitCcnica de Cataluna
Pau Gargallo 5,08028 Barcelona, SPAIN
This work was supported
by
the Ministry
of
Education and Science
of
Spain (CAICYTI under contract
PA85
-03
14.
ABSTRACT
The solution of the Algebraic Path Problem (APP) for arbitrarily sized graphs by a
fixed-size systolic array processor (SAP) is addressed. The APP is decomposed in two
subproblems.
A
SAP is designed for each one, both SAPs combined produce a highly
implementable versatile SAP (VSAP). The proposed VSAP has px p processing
elements (PES), solving the APP of an N-vertex raph in N3/p2
+
NWp
+
3p-2 cycles.
With slight modifications in the operations perkrmed by the PES the problem is
optimally solved in N3/p2
+
3p-2 cycles.
1.
INTRODUCTION
Important problems such as the Transitive Closure (TC), the All-Pairs Shortest
Path (SP), and the Gauss-Jordan Elimination are particular cases
of
the more general
APP.
Its
solution
is
a computingintensive task, it has cubic complexity with respect to
the size of the problem.
Using (N
+
1)2
PES in a hexagonally connected array the APP is solved in 7N cycles
in
[
11.
An array that solves the APP in 5N cycles with N
X
(N
+
1)
PES can be found in
[2].
In [3] a dependence graph based design method obtains new and already proposed
arrays for the APP. There, optimum arrays obtaining the result in
5N
cycles with
N
x N PES are proposed. In fact, many SAPs devoted to special cases of the APP can be
generalized for solving it. For example, SAPs for the TC and the
SP
[4],
[5],
[6],
[7].
There is always a computation
too
large for a given array. Mapping larger sized
computations into smaller arrays is of great ractical interest. The original problem is
decomposed into subproblems whose sizes &t the available SAP size. This paper is
devoted to decompose the APP and to design an array for solving them and combining
their partial results. APP instances like the TC have been partitioned [8],
[9],
but to
our knowledge there are no works about the partitioning
of
the APP in the literature.
In the next section the basics of the APP are briefly reviewed. Section
3
is devoted
to describe an iterative block algorithm for the APP. Then, in section
4
this algorithm
is rearranged for solving the APP using only two matrix operations, i. e., the APP is
decomposed into two subproblems. The design of a SAP for solving each subproblem,
and their combination to form the resulting VSAP is commented in section
5.
Block
VO
details, changes for achieving an optimal execution, and performance expressions
are given in section
6.
Finally, in section
7
the
outstanding points addressed in the
paper are discussed.
2.THE ALGEBRAIC PATH PROBLEM
Consider
a
weighted directed graph G= <V,E,w>, where
V
is
its N-vertex set,
E
NxN its edge set, and w:E-S an edge weighting function whose values are taken
from the set
S.
It belongs to a path algebra <S,+,X,*,O,l> to ether with two binary
operations, "addition"
+
:S
X
S-S,
and "multiplication"
X
:$X
S-S,
and a unary
operation called closure
()*:SAS.
Constants
0
and
1
belong to
S.
Hereby,
+
and
x
will
be used in infix notation, whereas the closure of an element a will be noted as
a*.
Additionally,
*
will have precedence over
X
and this over
+.
This algebra is a closed
semi-ring, i. e., an algebra fulfilling the following axioms
[lo]:
+
is associative.
CH2603-9/88/oooO/0265$01.00
0
1988
IEEE
265

266
commutative, with
0
as neutral;
X
is associative with
1
as
neutral, it distributes over
+
. Element
0
is absorptive with respect to
X.
The equality a*
=
1
+a
X
a*
=
1
+
a*X a
must hold.
The weight w(p) of a path p=(el,e2,
...,
em), eiCE, is defined as
w(P):
=
w(el)X w(e2)X
...
X
w(em)
.
The APP
is
the determination of the sum of weights
of all the possible paths between each pair of vertices (i
j).
If P(i
j)
is the set of all the
possible paths from i toj, the APP is to find the values
[lll:
Intemational Conference
on
Systolic
Arrays
d(ij):=
2
a(p)
We associate a weight matrix A={a(ij)}, lSij5N with graph G, where
a(i
j):
=
w((i
j))
if (i j)CE, and a(ij):
=
0
otherwise. This matrix representation permits
us to formulate the APP as the computation of a sequence of matrices AM
=
{a(ij)(k)}
01
k5 N. The value a(i
j)(k)
is the weight of all the possible paths from vertex i to j
with intermediate vertices v,
15
vi k. Initially, A(0): =A, then A*: =A(N), where
A*:={a(ij)*} is an NXN matrix satisfying a(ij)*=d(ij) if (ij)CP(ij), and a(ij)*:=O
otherwise.
The algorithms proposed to solve path problems formulated in matrix terms are
known as matrix-methods; most are referenced in [ll]. Among them we find
algorithms for the TC [121, [131; the
SP
[141; and the classic Gauss and Gauss-Jordan
eliminations to compute the inverse of a matrix.
It
was observed that the same program schemes could solve these problems. A
program scheme is a program with fixed control but where the sets over which the
variables take their values, and the meaning of the algebraic operations is left
uninterpreted
[
101. The majority of the program schemes useful for the APP come from
Linear Algebra, like Jacobi and Gauss-Seidel methods, and the mentioned Gauss and
Jordan eliminations [151. The Gauss-Jordan elimination has been recently introduced
for solving the APP [l]. The array described in
[2] is based on it.
There are many interesting r:blems that are specializations of the APP [15]. If
the algebra is boolean: S={O,l[
+"=OR,
"x"=AND, 0*=1, 1*=1,
"O"=O,
and
"O"=
+
-,
"1"=
0;
closure is the constant operation
0.
For matrix inversion:
S
=
R
,
"+"
and
"X"
are the ordinary addition and multiplication with respective neutrals
0
and
1.
The closure is a*= 1/(1-a) for a#
1.
p
E
P(ij)
"1"=1; the APP finds the TC. For the
SP:
S=[O,+m)U(+m},
"+"=min,
"
XI'
=
+
,
3.APP
BLOCK
DECOMPOSITION
Block algorithms are a useful tool for extracting parallelism from problems. For
example, when solving matrix roblems, both, input and output matrices can be split
into blocks or submatrices. Dikerent block subproblems can be allocated to different
processors. Processors must combine their partial results in order to attain the desired
global solution. This can be named as interblock parallelism.
Block algorithms are also used for solving problems in parallel using systolic
arrays. However, the approach is completely different. In this case, blocks must fit the
size of the available array. Subproblems are solved one after the other, with a proper
sequencing to combine their partial solutions efficiently. By doing
so,
the original
problem is solved. In the context of systolic computing, these are known as size-
independent algorithms, and problem decomposition is named partitioning
[
161,
[
171.
Systolic arrays explote intrablock parallelism. An array can be seen as a single
processor running a sequential algorithm, but operating on blocks instead of elements.
Normally, problem partitioning generates subproblems of different nature. For
example, the LU-decomposition is split into smaller matrix products with
accumulation, linear systems of equations, and LU-decompositions
[
181; the TC
requires solving smaller
TCs
and performing boolean matrix multiplications with
accumulation [9]. We are interested in solving all the resulting subproblems using one
sin learray.
Euppose that in the rocess of solving the APP by obtaining the matrix sequence
solve the APP for matrices with pXp elementsor smaller. The point is
how
to obtain
Ace),
...,
A(x), matrix A(k
P
has been already computed. Also assume we know how to

Algorithms
267
A(k+p) through block operations. The involved blocks are shown in figure
1.
Block A1
is a p
x
p block on the diagonal. Note that A2 and A3 have px (N-p) and (N-p)
x
p non-
zero elements respective1 while A4 has (N-p)
X
(N-p). Then, Theorem
1
indicates how
to obtain A(k
+PI
from A(k7;hrough block operations.
Operator
()*
represents the APP of a block,
+
and
X
are the natural extension to
matrices of addition and multiplication of the underlying algebra. Figure
2
illustrates
a block operation.
A4
A3
AI A2
k
P
N-p-k
Figure
1.
The blocks referenced
in Theorem
1.
Figure
2.
A
block operation illustrated.
Theorem
1:
Under the preceding assumptions, A(~+P) can be obtained from A(k) as
follows:
Al(k+p): =Al(k)* (3.a)
A2(k
+
p):
=Al(k)*
X
A2(k)
(3.b)
A3(k+p): =A3(k) XAl(k)* (3.c)
A4'k+p):=A4(k)
+
A3(k)XAl(k)*XA2(k) (3.d)
Proof:
A k-path in G is defined as a path whose intermediate vertices must belong to
the set {l..k}.With each matrix A(k)
we
can associate a graph G(k)= <V,E(k),w(k)>
whose edge weights are the sum of all the possible k-paths in G between each ordered
pair of vertices. A (k
+
p)-path in
G
is either a k-path (an edge in G(k)) or a path in G(k)
whose intermediate vertices are taken from the set {k
+
l..k
+
p}. An example of the
latter path is depicted in figure 3-a. If ij€{k+l..k+p} then (3.a) follows trivially.
Consider now (3.d), then ij€{l..N}-{k+l..k+p}. The general form of this path is
represented in figure 3-b. Note that a(vl,vm)(k+P), an element of Al(k)*, is the weight
of all the (k+p)-paths from vl to vm. Thus, the weight of this path from i to
j
is
a(i,vl)(k)
X
a(vl,vm)(k
+
P)
X
a(vmj)(k).
If
vertices are any in the specified sets, the
overall operations can be expressed in matrix terms as (3.d). The term A4W adds the
contributions of the k-paths. Cases (3.b) and (3.c) are simplifications of (3.d).0
It
is worth mentioning that block schemes for the APP can also we derived by
extending to blocks the properties fulfilled by elements, and then applying any valid
program scheme to blocks. Nevertheless, we have decided to present Theorem
1
because it is more general.
It
permits to design parallel algorithms using different
block sizes for the same problem. In addition, by employing graph instead of algebraic
terms, it provides insight in the effect of the underlying operations.
4.
TWO-PRIMITIVE APP PARTITIONING
As said, all subproblems must be solved by a single array. In this paper, we are
concerned with 2-D SAPS although the results also apply to one-dimensional
machines.
For
the moment, let us advance that the fixed-size array developed in the
next section has p
X
p processing elements
(PES).

268
Intemational Conference
on
Systolic
Arrays
i,jC(l..N}-(k+l..k+p)
vl,..,vmC(k+l..k+p}
(b)
a(i,vl)(k) a(vl,vm)(k
+
P)
a(vm,j)(k)
Figure
3.
(a)
A
(k
+
p)-path in
G('d
with intermediate vertices in
{k
+
l..k
+
p);
(b)
a general form of these paths.
Hence, it
is
natural to partition the NXN matrices to be computed into square
blocks with pxp elements. Without
loss
of generality, assume p divides N evenly. The
remaining of this section will deal with blocks, or groups of them. Some matrix
notaticn conventions are mandatory.
Suppose an NXN matrix A that is split into N/pXN/p blocks of size pxp. A(ij) is
one of these blocks placed at the i-th block-row, j-th block-column. The i-th block row,
and the j-th block column are respectively denoted as A(i,&) and A(&$. A(i,&-j) refers
to the i-th row without block A(i
j).
Under this convention, A(&-i,&-j) is obtained from
A, eliminating its i-th and j-th row and column. Subranges also need to be specified.
For instance, A(ij..k) is composed of blocks in the i-th row from column j to column k.
Now, we will present Algorithm
1,
that computes the APP of an NXN matrix from
the basic assumption that it is known how to obtain the APP of a pxp block.
Algorithm
1
is
a direct outcome of Theorem
1.
It is row oriented.
A
column oriented
version could be obtained by simply interchanging block indexes, and inverting the
order in which blocks are multiplied.
Algorithm
1
B(0): =A
fork:
=
1
to N/p
B(k,k)(k):
=
B(k,k)(k-l)* (4.a)
B(k,&-k)(k): =B(k,k)(k)
X
B(k,&-k)(k-l) (4.b)
B(&-k,&)(k):
=
B(&-k,&-k)(k-l)
+
B(&-k,k)(k-l) XB(k,&)(k) (4x1
end for
A*: =B(N/p)
Note that A(~P)
=
Blk), thus, after N/p iterations results A*
=
ACN
=
B(Y/p).
At this point, we have seen that the APP is decomposable into smaller APPs and
matrix multiplications (with or without accumulation), observing the rules of the
underlying algebra. The next stage is to modify Algorithm
1
for making it more
suitable of systolization. It is interesting to minimize the number of primitives
executable by the VSAP. Only two primitives, named
P1
and P2, suffice for the APP.
Let us define them. Consider three matrices
X,
Y,
Z
with sizes pxp, pxm, and pxm
respectively. WedefinePl andP2as:Pl(X,Y)
=
X*XY;andPP(X,Y,Z)
=
XXY+Z.
Algorithm
2
fork:
=
1
to N/p
Algorithm 2 solves the APP using only
P1
and P2.
B(0): =A
B(k,&)(k):
=Pl(B(k,k)(k-l),B(k,l..k-l)(k-1)IIplB(k,k
+
l..N)(k-l)) (4.d)
fori:
=
1
to
Nip with i
z
k
end for
B(i,&)(k): =P2(B(i,k)(k l),B(k,&)(kI,B(i,&-k)(k
1))
(4.e)
end for
A*:
=
B(N/p)

Algorithms
269
Lines (4.a) and (4.b) obtain A(k,&)(k). Using
P1
they can be grouped into line
(4.d),where the symbol
I
denotes matrix juxtaposition, and Ip is the pxp identity
matrix. On the other hand, line (c) is equivalent to:
allowing to use P2 for computing A(i,&)(k) in line (4.e).
5.
A
VERSATILE
SAP
FOR
THE
APP
When designing versatile SAPs, the additional complexity caused by partitioning
must be minimized in order to attain an economic and implementable machine. More
specifically, it is mandatory to reduce the overhead caused by partitioning in the
external hardware and communications, as well
as
in the required control. The
additional delays, and PE complexity have to be minimized too [19].
Intuitively, the VSAP can be obtained by overlapping, in some sense, two SAPs,
one for
P1
and the other for P2. Moreover, both candidate SAPs must be as similar
as
possible. An efficient sequencing between consecutive subproblems also has direct
influence on performance. Most of these points are taken into account in papers
dealing with the mapping of partitioned systolic algorithms into fixed-size arrays.
In order to attain
a
highly implementable design, additional constraints could be
observed. For example, by forcing equal input and output data formats, there is no
need for a data rearranging network between the VSAP and the external memory
modules. In other words,
a
single storage scheme can be used for all the matrices.
Another interesting restriction is to have memory modules only along one side of
the SAP'S polygon. This fact avoids having different algorithms to perform the same
operation over matrices entering the array from different points.
Algorithm
2
has been designed to use the same block in a (maybe long) sequence of
operations. Hence, in order to be used, this block is preloaded in the VSAP. This helps
to reduce the internal and the external
U0
bandwidth requirements. All these
constraints have influenced the design of the VSAP.
The Jordan algorithm for the quasi-inversion
of
a
matrix can be extended for
computing
P1
[15]. The following equation must hold: A* =A
x
A*
+
Ip; then by
postmultiplying both sides by B we can write: Pl(A,B) =A x Pl(A,B)
+
B. In the latter
equation Pl(A,B) is the unknown. The Jordan algorithm finds
it
by computing a
sequence of matrices A(k), B(k), and M(k),
15
k5
p,
with respective dimensions p
x
p,
pxm, and pxp. Initially A(o):={a(ij)(o)}=A
,
and B(o):={b(ij)(o)}=B
;
then
AM:
=
M(k)
x
A(k-1)
;
BW:
=
M(k)
x
B(k-1)
;
15
k5
p.
MW is obtained from the k-th
column of A(k-1). Figure 4 shows its strbcture. It can be shown that
B(P) =A*
x
B
=
Pl(A,B).
k-th
column
Figure
4.
The structure
of
matrix
Mk).

Citations
More filters
Proceedings ArticleDOI

Blocked All-Pairs Shortest Paths Algorithm for Hybrid CPU-GPU System

TL;DR: This paper presents a blocked algorithm for the all-pairs shortest paths (APSP) problem for a hybrid CPU-GPU system and estimates a required memory/communication bandwidth to utilize the GPU efficiently.
Journal ArticleDOI

Closed semiring connectionist network for the Bellman-Ford computation

TL;DR: A connectionist network architecture is proposed, called the binary relation inference network, which incorporates the Bellman-Ford computation for solving closed semiring problems, and offers an obvious advantage of being simply extensible for asynchronous and continuous time operation.
Journal ArticleDOI

A family of efficient regular arrays for algebraic path problem

TL;DR: This paper applies the method of decomposing a dependence graph into multiple phases with an appropriate m-phase schedule function to design several parallel algorithms for the algebraic path problem and derives N/spl times/N 2D regular arrays with execution times [9N/2]-2.

A binary relation inference network for constrained optimization

TL;DR: It was demonstrated that the inference network can solve the assignment problem in a way similar to the Hopfield net's solution to the traveling salesman problem, and the convergence of the network for the shortest path or transitive closure problems was shown to be independent of the problem size.
References
More filters
Journal ArticleDOI

Algorithm 97: Shortest path

Robert W. Floyd
- 01 Jun 1962 - 
TL;DR: The procedure was originally programmed in FORTRAN for the Control Data 160 desk-size computer and was limited to te t ra t ion because subroutine recursiveness in CONTROL Data 160 FORTRan has been held down to four levels in the interests of economy.
Journal ArticleDOI

A Theorem on Boolean Matrices

TL;DR: It is proved the validity of an algorithm whose running time goes up slightly faster than the square of d, the running times of which increase-other things being equal-as the cube of d.
Journal ArticleDOI

VLSI Array processors

Sun-Yuan Kung
- 01 Jan 1985 - 
TL;DR: A general overview of VLSI array processors and a unified treatment from algorithm, architecture, and application perspectives is provided in this article, where a broad range of application domains including digital filtering, spectrum estimation, adaptive array processing, image/vision processing, and seismic and tomographic signal processing.
Book

Graphs and Algorithms

TL;DR: Presents a review of graph theory, analyzing the existing links between abstract theoretical results and their practical implications using graph theoretical models and combinatorial algorithms.
Trending Questions (1)
Which system maintains employee information in graphical format SAP?

Both SAPs combined produce a highly implementable versatile SAP.