# Error Detection in Arrays via Dependency Graphs* 

EDWIN HSING-MEAN SHA AND KENNETH STEIGLITZ<br>Department of Computer Science, Princeton University, Princeton, NJ 08544

Received August 9, 1991; Revised December 3, 1991.


#### Abstract

This paper describes a methodology based on dependency graphs for doing concurrent run-time error detection in systolic arrays and wavefront processors. It combines the projection method of deriving systolic arrays from dependency graphs with the idea of input-triggered testing. We call the method ITRED, for Input-driven Time-Redundancy Error Detection. Tests are triggered by inserting special symbols in the input, and so the approach gives the user flexibility in trading off throughput for error coverage. Correctness of timing is proved at the dependency graph level. The method requires no extra PEs and little extra hardware. We propose several variations of the general approach and derive corresponding constraints on the modified dependency graphs that guarantee correctness. One variation performs run-time error correction using majority voting. Examples are given, including a dynamic programming algorithm, convolution, and matrix multiplication.


## 1. Introduction

Reliability is often a critical issue in applications of high-performance systolic or wavefront array processors, and for that reason much recent work has addressed the problems of on-line error detection (see, for example, [1]). We consider in this paper a flexible and general methodology for incorporating error detection in array design.

The two general approaches pursued in the literature for error detection are hardware and time redundancy. That is, one can detect errors by introducing additional computing hardware, perhaps duplicating PEs, or one can do duplicate computations using the same hardware. In general, there is a tradeoff between the decrease in throughput caused by the time redundancy, and the cost of the extra hardware used for hardware redundancy. A high degree of time redundancy can achieve good error detection, but at the cost of decreased throughput; a high degree of hardware redundancy can do the same without the attendant decrease in throughput, but at the cost of more hardware.

Much previous work takes advantage of the regularity of systolic arrays. For example [1] describes algorithm-based techniques that are especially suited to systolic arrays, but these are applicable only to a subset of linear systems, and it is unclear how to use

[^0]them on problems like the substring comparison we consider in Section 2. The work in [2], [3] uses dualmodule redundancy to detect errors; the essentially time-redundant technique of [4] applies only to unilateral linear arrays and results in a slowdown by a factor of two; [5] also deals with special classes of systolic arrays and agains halves the throughput rate using time redundancy. The method of roving spares described in [6] uses limited hardware redundancy, but it is not clear how to extend the method to bilateral arrays or more complicated structures.

This idea of using tokens to trigger error detection appears to have been introduced in [7]. They use both time and space redundancy, and a fixed periodic pattern of inserting tokens. In the case of unilateral linear arrays, the number of inserted tokens in the array at any instant cannot exceed the number of extra PEs. Thus, the frequency of token insertion is predetermined by the number of extra PEs. In the case of bilateral linear arrays, they make use of the idle PEs and idle cycles in the original computations for space and time redundancy, so only one extra $P E$ is needed.

We will combine two ideas to achieve run-time error detection: First, as in [7], we introduce special symbols in the input that signal the processors to perform comparisons for the purposes of detecting discrepancies. Typically, this is done by having two (or more) adjacent processors perform the same computation and comparing results. In contrast with [7], however, the frequency of insertion of these special symbols is determined by
the user at run time, rather than being pre-determined by hardware constraints. Second, we introduce the special symbols at the level of the dependency graph, and follow the effect through the projections used to arrive at a systolic or wavefront array [8].

There are several advantages to this general approach over more specialized or ad hoc approaches. First, it allows the user to determine the frequency of error checking at run time. Thus more error checking can be done when a lower throughput is acceptable. A second advantage stems from the fact that the method is expressed in terms of the dependency graph. This allows us to use previous work [8] on scheduling and projection to prove the correctness of the resulting working architectures. A third advantage is that the approach requires no extra PEs, and little extra hardware.

In the next section we briefly describe dependency graphs using the problem of finding minimum substringdistance as an example. In Section 3 we describe the general methodology of ITRED. In Section 4 we discuss our fault model at the level of array nodes, nodes in the signal flow graph that are mapped to the working architecture. The details of implementing ITRED for unilateral linear arrays, which include the minimum substring-distance problem and convolution, are discussed in Section 5. Section 6 then shows how to extend ITRED to more general problems, using matrix multiplication as an example. We prove correctness in Section 7. Finally, in Section 8 we show how ITRED can be adapted to handle some special design requirements.

## 2. Minimum Substring-Distance

In this section, we introduce as a working example the problem of finding minimum substring-distance. We use this problem to illustrate the dependency graph $D G$ and the mapping method for transforming a $D G$ to an array architecture [8]. String comparison is a timeconsuming and important operation in many applications, such as information retrieval, databases, artificial intelligence, pattern recognition, and DNA pattern matching.

The edit distance between two strings is the minimum number of basic operations (insertion, deletion and substitution) necessary to transform one string to the other. For example, chao can be transformed to sha by a sequence of three operations as follows:

```
chao (deletec) --> hao (deleteo) -->
    ha (insert s) --> sha.
```

But two transformations suffice:

```
chao (substitutes for c) -->
shao (delete o) --> sha.
```

In fact this is minimum, so the edit distance between the two strings is two.

Systolic arrays for computing edit distance between two strings have been described in [9]-[11]. In [12], Landau and Vishkin consider the problem of finding a substring of a string $S$ most similar to a given pattern $P$. Given string $S$ and pattern $P$, let $S(i: j)$ be the substring of $S$ from position $i$ to position $j$ and let $\operatorname{dis}(S(i$ : $j), P$ ) be the edit distance between $S(i: j)$ and $P$. The minimum substring-distance is the minimum distance $\operatorname{dis}(S(i: j), P)$, where $i$ and $j$ range from 1 to the length of $S$. Thus, the minimum substring-distance between the string "I like Systolic VLSI arrays," and "Systolic arrays" is five.

The problem of minimum substring-distance can be solved by two-dimensional dynamic programming, which in turn can be implemented by a one-dimensional systolic array.

An input instance of the problem is

$$
\begin{aligned}
& S=s_{1} s_{2} \ldots s_{n}: \text { a (long) string } \\
& P=p_{1} p_{2} \ldots p_{m}: \text { a (short) string }
\end{aligned}
$$

The output of the problem is the minimum of all edit distances of substrings $S(i-k: i)=s_{i-k} s_{i-k+1} \ldots s_{i}$ from the pattern $P$, where $1 \leq i \leq n, 0 \leq k \leq i-1$.

The dynamic programming algorithm proceeds as follows. Let $D[i, j]$ denote the minimum distance of all substrings as $s_{i}$ from the prefix $P(1: j)$, where $1 \leq i \leq n, 1 \leq j \leq m$. Initially,

$$
\begin{array}{ll}
D[i, 0]=0 & \text { for every } i \text { and } \\
D[0, j]=j & \text { for every } j .
\end{array}
$$

If we think of the $D[i, j]$ as being in a two-dimensional array, each $D[i, j]$ can be computed from the entries above, to the left, and above and to the left, as follows:

$$
\begin{aligned}
& \text { for } i=1 \text { to } n \text { do } \\
& \quad \text { for } j=1 \text { to } m \text { do } \\
& \quad D[i, j]=\min (D[i-1, j]+1, D[i, j-1]+1, \\
& D[i-1, j-1] \text { if } s_{i}=p_{j} \text { or } \\
& \qquad[i-1, j-1], \text { otherwise })
\end{aligned}
$$

When this double loop is completed, the entries $D[i, m]$ contain the minimum distance of all substrings ending at $s_{i}$ from the pattern $P$. If we consider each $\min$ operation as a node and represent each dependence


Fig. 1. Dependency graph for minimum substring-dist.
of an operation on data as a directed edge between two nodes, the resulting dependency graph $D G$ is as shown in figure 1. The graph $D G$ is acyclic and therefore computable.

We call a node in $D G$ a computation cell, or cell. As described in [8], the two design steps of processor assignment and scheduling can be used to map such a $D G$ to a lower dimensional signal flow graph $S F G$. We call a node of the signal flow graph a Processor Element (PE), this being justified because the signal flow graph is very close to a hardware specification for a SIMD systolic or wavefront array. Let an equiprocessor curve be a curve containing all the cells of the dependency graph that are projected onto one PE of the signal flow graph of lower dimension, and let an equitemporal surface be a surface containing all the computation cells that are active at a given time.

Usually, the equiprocessor curves are parallel straight lines, in which case we let $\vec{p}$ be a vector parallel to the equiprocessor lines, called the projection vector. Further, it is often the case that the dependency graph has a linear schedule; that is, all equitemporal surfaces are parallel hyperplanes, and so have a unique normal direction. Let $\vec{s}$ be a vector in this normal direction, called the schedule vector.

Kung [8] showed that given a projection vector $\vec{p}$, necessary and sufficient conditions for a linear schedule to be permissible, that is, represent a realizable computation in the signal flow graph, are the following:

$$
\begin{aligned}
& \text { (1) } \forall \text { edge } \vec{e} \text { in } D G, \vec{s}^{T} \vec{e} \geq 0 \text {. } \\
& \text { (2) } \vec{s}^{T} \vec{p}>0 \text {. }
\end{aligned}
$$

In our example of the minimum substring-distance problem, we can choose the projection vector $\vec{p}=(1,0)$
and the permissible linear schedule $\vec{s}=(1,1)$, as shown in figure 1. This leads to a signal flow graph with $m$ processors, where $m$ is the size of the pattern $P$, and that is reasonable since $n$, the size of the string $S$, is usually very much larger than $m$.

## 3. ITRED: General Approach

In this section we discuss ways of modifying dependency graphs to achieve error detection, and we will call a specific algorithm for doing so a strategy. The strategy determines the way in which special symbols are inserted in the input data stream. We propose two approaches. In the first, we derive some strategies that allow every $P E$ to be tested if the user chooses to provide the right inputs. In the second approach not only can every $P E$ be tested consecutively by choice of the input stream, but the computation results themselves can be produced by majority vote. We begin with the first approach, which is actually a special case of the second.

We use a special input symbol, called $\alpha$, which serves the purpose of informing a $P E$ to do error detection (as in [13]). When $P E_{i}$ receives an $\alpha$ symbol, $P E_{i}$ will do the same operation as $P E_{i-1}$ and compare its result with that of $P E_{i-1}$. (We assume here that $P E_{i}$ is in fact capable of performing the same operation as $P E_{i-1}$. If all processors are not identical, this requirement might require augmenting the capabilities of some of the processors.) If the results are not the same, an error has been detected. The user has the freedom to decide how frequently an $\alpha$ symbol is inserted in the original input. At one extreme, the user inserts no $\alpha$ symbols, in which case there is no decrease in throughput. At the other extreme, the user inserts an $\alpha$ symbol before each input data point in the original input stream, so the throughput becomes at most half the original speed. Thus, the tradeoff between speed and error coverage is under user control.

Definition 3.1. We say a strategy for inserting $\alpha$ 's into the input stream is $\alpha$-successful if all PEs are tested at least once and all computation cells have the correct timing.

Actually, ITRED can be easily extended so that every computation cell is tested, but sometimes we may need to add extra PEs so the computation cells on the border can be tested.

We want to think of adding the $\alpha$ symbols into the original dependency graph; to do this we add special
cells called $\alpha$ cells. In the dependency graph, the effect of an $\alpha$ symbol is similar to a delay, since when $P E_{i}$ receives an $\alpha$ symbol, it will save its state, discard what it produces after it simulates $P E_{i-1}$ 's computation, and then restore its previous state.

For simplicity, we first consider the case of a twodimensional dependency graph $G$ like the one in figure 1 , with $m$ columns and $n$ rows. Without loss of generality, we assume that data for a particular problem instance enters along a row (row input), and flows from column to column. Let $g_{i, j}$ be a computation cell, where $1 \leq i \leq n$, and $1 \leq j \leq m$.

To insert an $\alpha$ symbol in the input stream that travels from $P E$ to $P E$, insert a complete row of $\alpha$ cells in the dependency graph, as shown in figure 2 . If this row is inserted before row $i$, this splits $G$ into two parts, the part from row 1 to row $i-1$, and the part from row $i$ to the last row. Keep the edges that went from row $i-1$ to $i$ in the first part. Let $\vec{\alpha}$ be the vector normal to the added row, so $\vec{\alpha}$ is $(0,1)$. Note that in other, more general situations the inserted $\alpha$ symbols may not form a hyperplane, and therefore there may not be a well defined $\vec{\alpha}$ vector. We will see an example of this in a later section.

Let $\alpha^{j}, 1 \leq j \leq m$ be the row of added $\alpha$ cells, ordered in the direction of increasing time. If column $j$ is projected to $P E_{j}$, add the directed edge ( $\alpha^{j}, g_{i, j}$ ). Call these edges delay edges and denote by $c^{j}$ the computation cell pointed to by the delay edge leaving $\alpha^{j}$. Since $\alpha^{j}$ and $c^{j}$ project to the same $P E$, the difference between their coordinate vectors is a vector parallel to $\vec{p}$. Figure 1 shows the original dependency graph for the minimum substring-distance problem and figure 2 shows the dependency graph modified in the way just discussed.

An $\alpha$ stream inserted into the dependency graph in this way can be regarded as a surface, which we call an $\alpha$-surface. When the $\alpha$-surface is a hyperplane, we can call it an $\alpha$-hyperplane. We say that an $\alpha$-surface is a cutting surface if removing it separates the dependency graph into disconnected pieces. We say that a cutting surface is unicutting if all the edges crossing this surface cross it in the same direction. Cutting or unicutting hyperplanes are defined analogously.

We next derive constraints on the way in which the original dependency graph should be modified so that testing takes place correctly. We prove later that these conditions are sufficient to ensure that a strategy is $\alpha$ successful. Observe first that since we need to test every $P E$, the vector $\vec{\alpha}$ cannot be perpendicular to the vector


Fig. 2. Modified dependency graph for minimum substring-dist.
$\vec{p}$, and in fact every $P E$ should be the image under projection of at least one $\alpha$ cell. Furthermore, because we do not intend to increase the number of $P E s$, we also require that each $P E$ be the image under projection of at least one computation cell.

We know that different PEs should be tested at different times, so the vector $\vec{\alpha}$ cannot be parallel to the vector $\vec{s}$. (When working architecture is a wavefront array, this sequential property of the testing will be naturally ensured by the fact that the testing is datadriven.) Since each $\alpha^{j}$ is basically a delay for some later operation $c^{j}$ by the same $P E$, the delay edge should be in the same direction as the vector $\vec{p}$.

Let $P E^{j}$ be the $P E$ to which $\alpha^{j}$ is projected. We know that whenever a $P E$ receives an $\alpha$, this $P E$ needs to do the same operation as its neighboring $P E$ will do. Thus, for each $\alpha^{j}$ there should exist a computation cell (not an $\alpha$ cell) that is projected to $P E^{j}$ 's neighbor at the same time that the $\alpha$ cell is projected to $P E^{j}$. We summarize the constraints discussed above in the following, which we call the $\Sigma$ constraints for hyperplanes.
$\Sigma$ constraints for hyperplanes:
o. $\vec{\alpha}$ is not parallel to $\vec{s}$

1. $\exists$ an $\alpha$ cell on the border at which data arrives
2. all delay edges are parallel to $\vec{p}$
3. $\forall P E, \exists$ an $\alpha$ cell which is projected to $P E$
4. $\forall P E, \exists$ a computation cell which is projected to $P E$
5. $\forall \alpha^{j}, \exists$ a non $-\alpha$ computation cell that is in the same equitemporal hyperplane as $\alpha^{j}$ and is projected to a neighboring $P E$ of $P E^{j}$
6. The $\alpha$-hyperplane is unicutting

As noted above the zeroth constraint is not needed at all when the working architecture is a wavefront array, so we assume without loss of generality that the working architecture is a synchronous, systolic array, rather than a wavefront array. Actually, the zeroth constraint is implied by the fifth constraint, so it is redundant and can be omitted. If the equitemporal surface or the $\alpha$ surface is not a hyperplane, we can generalize the above constraints easily as follows:
$\Sigma$ constraints:

1. ヨ an $\alpha$ cell on the border at which data arrives
2. all delay edges are parallel to $\vec{p}$
3. $\forall P E, \exists$ an $\alpha$ cell which is projected to $P E$
4. $\forall P E, \exists$ a computation cell which is projected to $P E$
5. $\forall \alpha^{j}, \exists$ a non- $\alpha$ computation cell that is in the same equitemporal surface as $\alpha^{j}$ and is projected to a neighboring $P E$ of $P E^{j}$
6. The $\alpha$-hyperplane is unicutting

If the projection, schedule, and modified dependency graph satisfy the above constraints, we say that this dependency graph is correctly modified. We leave for Section 7 a proof that a correctly modified dependency graph is $\alpha$-successful.

In the second approach to modifying the dependency graph, majority voting is applied. In this scheme $k$ adjacent PEs will perform the same operation, the output will be the majority result, and error detection will be performed at the same time. We introduce $k-1$ special symbols $\alpha_{1}, \ldots, \alpha_{k-1}$, which play roles similar to the $\alpha$ symbol. For simplicity, we assume that $k$ is 3 , but it is straightforward to extend $k$ to be any odd number. When $P E_{i}$ receives an $\alpha_{1}$ symbol, it performs the same action as before-it simulates a computation in the adjacent $P E$, say $P E_{i-1}$. If $P E_{i+1}$ receives an $\alpha_{2}$ symbol, it simulates the computation of a $P E$ which is distance-2 from it, say $P E_{i-1}$. We need to guarantee that $P E_{i+1}$ receives $\alpha_{2}$ and $P E_{i}$ receives $\alpha_{1}$ at the same time, and at a time when they can both simulate the same computation by $P E_{i-1}$, do the error detection, and output the majority result.

Therefore, $\alpha_{2}$ should immediately precede $\alpha_{1}$ in the $\alpha$ stream. The constraints analogous to the $\Sigma$ constraints for performing majority voting are given below, with all terms previously used now indexed by the same index $i$ as the corresponding symbol $\alpha_{i}$. For example, $\vec{\alpha}_{i}$ is the normal vector for the $\alpha_{i}$ hyperplane.

```
\Sigma maj_k
    1. all the }\mp@subsup{\vec{\alpha}}{i}{}\mathrm{ are parallel to each other
    2. the }\mp@subsup{\alpha}{k-1}{},\ldots,\mp@subsup{\alpha}{1}{}\mathrm{ -symbols are in the same
        equitemporal hyperplane, and are pro-
        jected to k-1 adjacent PEs
    3. the }\mp@subsup{\alpha}{1}{}\mathrm{ -hyperplane satisfies the }
        Constraints
```

The corresponding more general constraints for the case of surfaces are:

```
\Sigmamaj_k
    1. all the }\mp@subsup{\alpha}{i}{}\mathrm{ -surfaces are parallel to each
        other
    2. the }\mp@subsup{\alpha}{k-1}{},\ldots,\mp@subsup{\alpha}{1}{-symbols are in the same
        equitemporal surface, and areprojected
        to k-1 adjacent PEs
    3. the }\mp@subsup{\alpha}{1}{}\mathrm{ -surface satisfies the }
        Constraints
```

For example, the modified dependency graph in figure 3 satisfies the above $\Sigma_{m a j \_k}$ constraints. Note that if we want every computation cell in the dependency graph to be tested $k P E s$, we may need to add some


Fig. 3. Modified dependency graph for the minimum substringdistance problem (approach 2).
extra PEs to take care of the cells on the border of the dependency graph.

In the remainder of this paper we assume that ITRED uses the first approach (no majority voting), unless we explicitly state otherwise.

## 4. Fault Model

Given a dependency graph, we project it to a lower dimensional signal flow graph [8], and map this signal flow graph to a working architecture. Each cell of the signal flow graph that is mapped to the real working architecture is called an array node, which can usually be regarded as a $P E$. We use array and fault models similar to those in [2], [3], [7].

Each $P E$ is composed of two parts: the buffers and the processing unit ( $P U$ ). The buffers can be divided into two parts: the data buffers $(D B)$ and internal buffers (IB). $D B$ holds the input data and $I B$ holds the state necessary to perform the next operation.

In our first approach to run-time error detection, every two consecutive $P E s$ do the same operation and compare results. In the second approach, a majority vote determines the outcome if a discrepancy occurs. The comparator and majority voter can be implemented to be totally self-checkable [14], [13], and faults in buffers or communication can be detected and corrected by using coding techniques [14], [13]. The extra hardware for error detection in ITRED is so simple, and therefore can be built so reliably, that we can assume all faults occur in PEs.

A fault here will mean a functional fault, not the traditional gate-level stuck-at fault. In the first approach it is usually convenient to assume that when two adjacent PEs have their outputs compared, and they are both faulty, then their incorrect outputs are different, so that an error is detected immediately. Similarly, in the second approach, where we compare the outputs of $k$ adjacent $P E$ s operating on the same inputs, we assume that no $k$ adjacent faulty $P E$ s whose outputs are compared produce identical (incorrect) results.

## 5. One-Dimensional Linear Arrays

In this section we give details of the application of ITRED in the simplest case-one-dimensional linear arrays. Two-dimensional meshes and more complicated topologies are considered in the next section. As mentioned in Section 3 the constraints for introducing $\alpha$ symbols are more stringent for systolic arrays than wavefront arrays, so we restrict attention to the former. We say a linear array is unilateral if data flows in only one direction (see figure 4 for an example). We say a linear array is bilateral if data can flow between two PEs in both directions. We begin with details for the first approach in the unilateral case, and discuss the second approach and the bilateral case subsequently.

Let $P E_{1}$ be the leftmost $P E$ and $P E_{i}$ the $i$ th $P E$ from the left. For the case of a linear systolic array, this first approach yields a result similar to the one in [7], but no extra $P E$ is needed. When $P E_{i}$ receives an $\alpha$ symbol, it will do the same operation as $P E_{i-1}$ and compare both results. If the results are not the same, an error has been detected. If there are $c \alpha$ 's, as long as there is at least one input data value between any two consecutive $\alpha$ 's, $c$ different pairs of PEs can concurrently check their results. In figure 4 , there are two $\alpha$ 's and we show the sequence of pairs which do error detection at different clock times.

We next explain the details of the extra hardware required to implement ITRED. As mentioned above, the $P E s$ are divided into processing unit $P U$, and bufferswhich in turn are divided into the data buffer $D B$ and the ouptut buffer $I B$. The buffer $I B$ normally stores $P U$ 's previous output. We index $P U, D B$ and $I B$ according to their corresponding $P E$.

Without loss of generality, we assume a three-phase clock. In the ordinary situation (without error detection), during the first phase (input phase) $P U_{i}$ loads data from $D B_{i-1}$ and some part of $I B_{i-1}$ into $D B_{i}$. During the second phase (processing phase) processing unit $P U_{i}$ gets input from $D B_{i}$ and $I B_{i}$, and performs its operation. During the third phase (output phase) $P U_{i}$ loads its result to $I B_{i}$ (again, assuming no error detection).


Fig. 4. An example of a unilateral linear array using ITRED.


Fig. 5. The PE cells.

In error-detection mode, when $P E_{i}$ receives an $\alpha$ symbol, it will do the same operation as $P E_{i-1}$ and compare results. The input phase is as before, passing along the $\alpha$ symbol in the input data stream. In the processing phase, $P U_{i}$ needs to get its input from $D B_{i-1}$ and $I B_{i-1}$. In the output phase, $P E_{i}$ will not load its results to $I B_{i}$, so as to preserve the old contents of $I B_{i}$ for further use. The only extra thing $P E_{i}$ needs to do in the output phase is to check its output with the output from $P E_{i-1}$. A block diagram for $P E_{i}$ and $P E_{i-1}$ is shown in figure 5.

Now we need to make sure that $P E_{i}$ will be in the correct state and get the correct input after an $\alpha$ symbol has passed through it. When $P E_{i}$ receives an $\alpha$ symbol it does not perform its real operation but performs the same operation that $P E_{i-1}$ does. At the next clock tick, say time $j$, since $I B_{i}$ did not change at time $j-1$, and data to $D B_{i}$ is also delayed one tick (because of the $\alpha$ symbol in the input stream), $P E_{i}$ can perform the same operation as it would have without the $\alpha$ symbol.

We give a simple example in figure 6. Assume the original input data is first $a$, and then $b, c$, and write $a_{i}$ to indicate the state of $P E_{i}$ after processing $a$. The succession of $P E$ states without error detection is shown at the top of figure 6. Next, consider what happens when the user inserts an $\alpha$ after $a$ to do error detection. We write $a_{i}^{*}$ to indicate that $P E_{i}$ 's internal buffer has


Fig. 6. An example showing correct timing for a unilateral array.
not changed, which happens when $P E$ receives an $\alpha$ symbol. The bottom of figure 6 shows the modified succession of events, and verifies the fact that each $P E$ receives the correct inputs and is in the correct states at the right times. From this example, we can see that the timing under a particular strategy may not be obviously correct. A general proof of correctness for ITRED will be given in Section 7.

We next discuss the second approach, where the results of more than two computations are compared.

For the purpose of discussion, we assume that the parameter $k$ is 3 , so there are two special symbols $\alpha_{1}$ and $\alpha_{2}$. Since $P E_{i+1}$ now needs to simulate the computation in $P E_{i-1}$, there needs to be a new data line from $P E_{i-1}$ to $P E_{i+1}$. We need also to include a majority voter in every $P E$, which entails only a simple modification of the hardware in figure 5. One correctly modified dependency graph for the above example is shown in figure 3, and a more condensed version of the same systolic array is shown in figure 7. In the next section, we will demonstrate the application of this approach to a two-dimensional systolic array for matrix multiplication.

Next we illustrate the application of ITRED to the case of bilateral linear arrays using the example of convolution. Given two sequences $x(j)$ and $y(j), i=0$, $\ldots, n-1$, the convolution for $x$ and $y$ is

$$
z(i)=\sum_{j=0}^{n} x(j) y(i-j)
$$

where $i=0, \ldots, 2 n-2$. The dependency graph is shown in figure 8.

We first modify the dependency graph to add $\alpha$ symbols, taking care to satisfy the $\Sigma$ constraints. The vector $\vec{p}$ can be chosen to be $(1,1)$, which results in a bilateral linear array. Inserting rows of $\alpha$ 's results in a unicutting $\alpha$-hyperplane, and the vector $\vec{\alpha}=(1,0)$. We then add delay edges that are parallel to $\vec{p}$, shown as bold edges in figure 9. Finally, we choose a schedule, which results in the signal-flow graph shown below the dependency graph. Note that this choice of schedule results in equi-


Fig. 7. More condensed systolic array.


Fig. 8. The dependency graph for convolution.
temporal surfaces that are not hyperplanes. It is now easy to verify the remaining $\Sigma$ constraints: for every $\alpha^{j}$, there exists a non- $\alpha$ computation cell that is in the same equitemporal surface as $\alpha^{j}$ and is projected to a neighboring $P E$ of $P E^{j}$. Figure 9 shows the final, correctly modified dependency graph.

In the original dependency graph every other $P E$ is idle at any given time, and the schedule can use these idle PEs to simulate their neighbors. Under this


Fig. 9. Modified dependency graph for convolution.
schedule, at most one extra clock period is needed after any number of $\alpha$ symbols are inserted. Although some vertical edges in figure 9 are in equitemporal surfaces, it is still a legal systolic scheduling, since these vertical edges point to $\alpha$ cells and not computation cells. The result in this simple example differs from that of [7] in the following respects: First, our method does not need an extra PE. Second, [7] assumes that a $P E$ becomes idle at every other cycle, and that every other $P E$ is idle at any given time. Our method, however, does not depend on this assumption, but still works when there are no idle PEs or no idle cycles are available.

The same general scheduling strategy works for the second approach when $k=3$. We use idle $P E s$ for simulation, and the throughput is reduced by a factor of at most 2 instead of 3 .

## 6. An Example of a Two-Dimensional Working Architecture

In this section, we illustrate how ITRED can be used to incorporate error detection in a two-dimensional systolic mesh for matrix multiplication. Given two $n$ by $n$ matrices $A$ and $B$, we want to compute $C=A B$. Thus,

$$
c_{i, j}=\sum_{k=1}^{n} a_{i, k} b_{k, j},
$$

where $1 \leq i, j \leq n$. Writing this as the single assignment statement

$$
c_{i, j, k}=c_{i, j, k-1}+a_{i, k} b_{k, j}
$$

leads to the three-dimensional dependency graph shown in figure 10 , with axes $(i, j, k)$.

We choose the projection vector to be $\vec{p}=(0,0,1)$, and the $\alpha$-hyperplane to be the two-dimensional plane of the input data $A$, which means that $\vec{\alpha}=(0,0,1)$ (see figure 11). The vector $\vec{s}$ can be taken to be ( $1,1,1$ ). It is easy to verify that with these choices the $\Sigma$ constraints are satisfied, and the correctly modified dependency graph is shown in figure 11.

For the second approach, we can use a twodimensional hexagonal array to implement majority voting for $k=3$. A modified dependency graph can be easily obtained from the graph in figure 11 by substituting $\alpha_{1}$ for $\alpha$ and adding an $\alpha_{2}$ hyperplane above the $\alpha_{1}$ hyperplane. When $P E_{i, j}$ receives $\alpha_{1}$, it will simulate the computation in $P E_{i, j-1}$, and when $P E_{i+1, j}$ receives $\alpha_{2}$, this $P E$ will also simulate the computation
in $P E_{i, j-1}$. The corresponding two-dimensional hexagonal array representing the working architecture is shown in figure 12.


Fig. 10. The dependency graph for matrix multiplication.


Two dimensional mesh at time 3
Fig. 11. The modified dependency graph for matrix multiplication.


Fig. 12. The hexagonal array for the second approach.

## 7. Proof of Correctness of ITRED

In this section we prove that ITRED results in a correct design if the $\Sigma$ constraints are satisfied. We begin with a lemma. We say that a dependency graph is feasible for ITRED if $\alpha$ symbols can be inserted at inputs, and each $P E$ receiving an $\alpha$ symbol will delay its own computation and simulate the computation of a neighboring $P E$.

Lemma 7.1. A dependency graph modified according to the $\Sigma$ constraints will be feasible for ITRED, and no extra PEs will be introduced.

Proof. Constraint 1 (there is an $\alpha$ cell on the border where data arrives) ensures that $\alpha$ symbols can be inserted in the input. Constraint 2 (delay edges are parallel to the projection vector) ensures that a $P E$ will do its delayed computations later. Constraint 4 (every $P E$ is the image of a computation cell) ensures that there are no extra PEs. Constraint 5 (there is a non- $\alpha$ computation cell in the same equitemporal surface as $\alpha^{j}$ that projects to a neighbor of $P E^{j}$ ) ensures that PEs neighbor does its normal computation at the same time that $P E^{j}$ simulates it.

We can now prove our main result. Recall that a strategy for inserting $\alpha$ 's is termed alpha-successful if it results in all PEs being tested at least once, and with correct timing.

Theorem 7.2. A strategy for ITRED that obeys the $\Sigma$ constraints is $\alpha$-successful.

Proof. From lemma 7.1 we know that the modified dependency graph can be used by ITRED. Constraint 3 (every $P E$ is the pre-image under projection of an $\alpha$ cell) implies that every $P E$ can be tested. It remains to be shown that the timing is correct.

An $\alpha$ cell represents a delay (or null operation) in the modified dependency graph. Recall that from constraint 6 (the $\alpha$-surface is unicutting) we know that all edges cross the $\alpha$-surface in the same direction.

Let $i_{1}, i_{2}, \ldots, i_{k}$ be incoming data for one computation of a normal computation cell in the original, unmodified dependency graph. Suppose for a contradiction that after the $\alpha$ 's are inserted and the computation graph modified, one of the data items, say $i_{j}$, arrives earlier than the other data. Then it was not delayed by an $\alpha$ cell, which contradicts the condition that the $\alpha$ surface is a cutting surface. If it arrives later than the
other data items, it crossed the $\alpha$ surface more than once, which contradicts the fact that the dependency graph is acyclic and the $\alpha$ surface is unicutting. Thus the required data items arrive together at the correct time, which finishes the proof.

The proof can be extended easily to the second approach.

## 8. Diagonal Projection with Modified ITRED

In this section we give an example where a certain choice of a projection vector $\vec{p}$ results in a signal flow graph for which it appears impossible to apply the ITRED method without introducing extra PEs. We then show how to modify the ITRED method to handle this case, and how to modify the $\Sigma$ constraints to reflect this modification. This example is meant to illustrate the flexibility of the approach, and suggest ideas for further applications.

The example is the minimum substring-distance discussed in Section 2. For simplicity, assume that strings $S$ and $P$ both have the same length $n$. Suppose now that given the dependency graph in figure 1 , for some reason the designer chooses the projection vector $\vec{p}$ to be $(1,1)$, resulting in a diagonal projection. If now the $\alpha$ surface is chosen to be a row (column), $\alpha$ symbols will pass through only the right (left) half of the processors, violating constraint 3 and resulting in a design where not all the processors can be tested. It is clear that we must introduce $\alpha$ 's into both rows and columns. Figure 13 shows such an $\alpha$ surface. This satisfies both contstraints 3 and 4: every $P E$ is the image under projection of both of an $\alpha$ cell and a normal computation cell.

But now we run into a problem because constraint 1 is violated: there is no $\alpha$ cell on the border at which input data enters. We can in effect generate $\alpha$ symbols from inside the dependency graph by modifying the ITRED method as follows. Each data value that needs to be transmitted between two PEs will be in one of the two states: normal, or $\alpha^{\prime}$. If the user wants to test the $P E$ s, an input data point is inserted in the $\alpha^{\prime}$ state; otherwise, the input data is inserted in the normal state. Note that we do not insert special $\alpha$ symbols here. Whenever two data values that are both in the $\alpha^{\prime}$ state meet at $P E_{i}$, that $P E$ changes the state of the data values to normal, simulates the same operation as one of its neighbors, and sends $\alpha$ symbols on in accordance with the modified dependency graph. That is, $P E_{i}$ behaves as if it had received an $\alpha$ symbol, and then


Fig. 13. Modified dependency graph for a bilateral linear array.
generates $\alpha$ symbols for the other processors. In our example, two data values in the $\alpha^{\prime}$ state are inserted into row and column inputs, and meet in the middle $P E$. At the next clock interval, two $\alpha$ symbols are sent to the left and right neighboring PEs respectively (see figure 13).

There is no decrease in throughput with this scheduling. Also, as before, although some vertical and horizontal edges are in an equitemporal surface, the schedule is still systolic because these edges point to $\alpha$ cells. This modified strategy does result in one disadvantage: the last computation cell in the first row and first column cannot be tested. All the other computation cells can be tested, however.

In our example, although there is no $\alpha$ cell on the border at which data arrives, the union of the row and column of $\alpha$ cells forms a unicutting surface in the dependency graph. Thus, if the PEs introduce a delay when they receive an $\alpha$ symbol, the timing correctness will be preserved. To take this new method into account, we should change the $\Sigma$ constraints by substituting the following for constraints 1 and 6:

```
1'. the union of }\alpha\mathrm{ cells is a unicutting
    surface
```

The proofs of lemma 7.1 and theorem 7.2 then go through with obvious changes for this more general version of ITRED.

## 9. Conclusions

We proposed a new methodology for run-time error detection in systolic and wavefront arrays. The method is based on modifying the dependency graph to allow special symbols to enter the computation. These special symbols cause error checking to take place. We developed a set of constraints, the $\Sigma$ constraints, and showed that they are sufficient to ensure that the timing is correct, that every $P E$ can be tested, and that no extra PEs are introduced. Since the design choices are made at the abstract level of the dependency graph, the approach is very general, and can be applied to a wide variety of arrays in any dimension.

## References

1. J.A. Abraham, et al., "Fault tolerance techniques for systolic arrays," IEEE Computer, 1987, pp. 65-74.
2. E.S. Manolakos and S.Y. Kung, "CORP-a new recovery procedure for VLSI processor arrays," IEEE Symp. on the Engin. of Computer Based Medical Systems, 1988.
3. E.S. Manolakos and S.Y. Kung, "Neighbor assisted recovery in VLSI processor arrays," European Signal Processing Symposium, EUSIPCO '88, North Holland, 1988.
4. C.-C. Wu, and T.-S. Wu, "Concurrent error correction in unidirectional linear arithmetic arrays," Proc. Int. Symp. FaultTolerant Computing, 1987, pp. 136-141.
5. R. Cosentino, "Concurrent error correction in systolic architectures," IEEE Trans. on Computer-Aided Design, vol. 7, 1988, pp. 117-125.
6. L. Shombert and D.P. Siewiorek, "Using redundancy for concurrent testing and repairing of systolic arrays," Proc. Int. Symp. Fault-Tolerant Computing, 1987, pp. 246-249.
7. Y.H. Choi, S.M. Han, and M. Malek, "Fault diagnosis of reconfigurable systolic arrays," Proc. Int'l. Conf. Computer Design: VLSI in Computers, 1984, pp. 451-455.
8. S.Y. Kung, VLSI Array Processors, Englewood Cliffs, NJ: Prentice Hall, 1988.
9. H.-H. Liu and K.-S. Fu, "VLSI arrays for minimum-distance classifications," VLSI for Pattern Recognition and Image Processing, (King-Sun Fu, ed.), New York: Springer-Verlag, 1984.
10. R.J. Lipton and D. Lopresti, "A systolic array for rapid string comparison," 1985 Chapel Hill Conference on Very Large Scale Integration, (Henry Fuchs, ed.), Rockville, MD: Computer Science Press, 1985, pp. 363-376.
11. R.J. Lipton and D. Lopresti, "Comparing long strings on a short systolic array," 1986 International Workshop on Systolic Arrays,, Oxford: University of Oxford, 1986.
12. G.M. Landau and U. Vishkin, "Introducing efficient parallelism into approximate string matching and a new serial algorithm," ACM STOC, 1986, pp. 220-230.
13. P.K. Lala, Fault Tolerance and Fault Testable Hardware Design, Englewood Cliffs, NJ: Prentice Hall, 1987.
14. J.F. Wakerly, Error Detecting Codes, Self-checking Circuits and Applications, New York: North Holland, 1978.


Edwin Hsing-Mean Sha received the B.S.E. degree in computer science and information engineering from National Taiwan University, Taipei, Taiwan, in 1986, and the M.A. degree and Ph.D. degree in computer science from Princeton University in 1990 and 1992. He is going to join the faculty of the Department of Computer Science and Engineering at the University of Notre Dame in the fall of 1992. His research interests include fault tolerant computing, testing, VLSI architectures, high-level synthesis in VLSI, and algorithms.


Kenneth Steiglitz received the B.E.E. (magna cum laude), M.E.E., and Eng.Sc.D. degrees from New York University, New York, NY, in 1959, 1960, and 1963, respectively.

Since September 1963 he has been at Princeton University, Princeton, NJ, where he is now Professor of Computer Science, teaching and conducting research on parallel architectures, signal processing, optimization algorithms, and cellular automata. He is the author of Introduction to Discrete Systems (New York: Wiley, 1974), and coauthor, with C.H. Papadimitriou, of Combinatorial Optimization: Algorithms and Complexity (Englewood Cliffs, NJ: Prentice Hall, 1982).

Dr. Steiglitz served two terms as a member of the IEEE Signal Processing Society's Administrative Committee, as chairman of their Technical Direction Committee, member of their VLSI Committee, their Digital Signal Processing Committee, and as their Awards Chairman. He is an Associate Editor of the journal Networks, and is a former Associate Editor of the Journal of the Association for Computing Machinery. A member of Eta Kappa Nu, Tau Beta Pi, and Sigma Xi, he was elected Fellow of the IEEE in 1981, received the Technical Achievement Award of the Signal Processing Society in 1981, their Society Award in 1986, and the IEEE Centennial Medal in 1984.


[^0]:    *This work was supported in part by NSF Grant MIP-8912100, and U.S. Army Research Office-Durham Grant DAAL03-89-K-0074.

