

Programming in VLSI:<br>From Communicating Processes<br>To Delay-Insensitive Circuits

Alain J. Martin

Department of Computer Science California Institute of Technology

Caltech-CS-TR-89-1

| Report Documentation Page |  |  |  | Form Approved OMB No. 0704-0188 |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. |  |  |  |  |  |
| 1. REPORT DATE 1989 |  | 2. REPORT TYPE |  | 3. DATES COVERED00-00-1989 to 00-00-1989 |  |
| 4. TITLE AND SUBTITLE <br> Programming in VLSI: from Communicating Processes to Delay-Insensitive Circuits |  |  |  | 5a. CONTRACT NUMBER |  |
|  |  |  |  | 5b. GRANT NUMBER |  |
|  |  |  |  | 5c. PROGRAM ELEMENT NUMBER |  |
| 6. AUTHOR(S) |  |  |  | 5d. PROJECT NUMBER |  |
|  |  |  |  | 5e. TASK NUMBER |  |
|  |  |  |  | 5f. WORK UNIT NUMBER |  |
| 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) <br> Defense Advanced Research Projects Agency,3701 North Fairfax Drive,Arlington,VA,22203-1714 |  |  |  | 8. PERFORMING ORGANIZATION REPORT NUMBER |  |
| 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) |  |  |  | 10. SPONSOR/MONITOR'S ACRONYM(S) |  |
|  |  |  |  | 11. SPONSOR/MONITOR'S REPORT NUMBER(S) |  |
| 12. DISTRIBUTION/AVAILABILITY STATEMENT <br> Approved for public release; distribution unlimited |  |  |  |  |  |
| 13. SUPPLEMENTARY NOTES |  |  |  |  |  |
| 14. ABSTRACT see report |  |  |  |  |  |
| 15. SUBJECT TERMS |  |  |  |  |  |
| 16. SECURITY CLASSIFICATION OF: |  |  | 17. LIMITATION OF ABSTRACT | 18. NUMBER OF PAGES 68 | 19a. NAME OF <br> RESPONSIBLE PERSON |
| a. REPORT unclassified | b. ABSTRACT unclassified | c. THIS PAGE unclassified |  |  |  |

# PROGRAMMING IN VLSI: <br> FROM COMMUNICATING PROCESSES TO DELAY-INSENSITIVE CIRCUITS 

Alain J. Martin<br>to appear: UT Year of Programming Institute on Concurrent Programming C.A.R. Hoare, editor; Addison-Wesley, 1989

The research described in this report was sponsored by the Defense Advanced Research Projects Agency, ARPA Order Numbers 3771 and 6202; and monitored by the Office of Naval Research, under contract number N00014-87-K-0745

Department of Computer Science
California Institute of Technology
Pasadena, CA 91125
Caltech-CS-TR-89-1

# Programming in VLSI 

From Communicating Processes
to Delay-Insensitive Circuits

## 1

Alain J. Martin

California Institute of Technology

Delays have dangerous ends.
-William Shakespeare

## Introduction

With chip size reaching one million transistors, the complexity of VLSI algorithms -i.e., algorithms implemented as digital VLSI circuits-is approaching that of software algorithms-i.e., algorithms implemented as code for a stored-program computer. Yet design methods for VLSI algorithms lag far behind the potential of the technology.

Since a digital circuit is the implementation of a concurrent algorithm, we propose a concurrent programming approach to digital VLSI design. The circuit to be designed is first implemented as a concurrent program that fulfills the logical specification of the circuit. The program is then compiled -manually or automatically-into a circuit by applying semantic-preserving program transformations. Hence, the circuit obtained is correct by construction.

The main obstacle to such a method is finding an interface that provides a good separation of the physical and algorithmic concerns. Among the phys-
ical parameters of the implementation, timing is the most difficult to isolate from the logical design, because the timing properties of a circuit are essential not only to its real-time behavior, but also to its logical correctness if the usual synchronous techniques are used to implement sequencing.

For this reason, delay-insensitive techniques are particularly attractive for VLSI synthesis. A circuit is delay-insensitive when its correct operation is independent of any assumption on delays in operators and wires except that the delays be finite [17]. Such circuits do not use a clock signal or knowledge about delays.

Let us clarify a matter of definitions right away: The class of entirely delayinsensitive circuits is very limited. Different asynchronous techniques distinguish themselves in the choice of the compromises about delay-insensitivity.

Speed-independent techniques assume that delays in gates are arbitrary, but that there are no delays in wires. Self-timed techniques assume that a circuit can be decomposed into equipotential regions inside which wire delays are negligible [16]. In our method, certain local "forks" are introduced to distribute a variable as inputs of several operators. We assume that the differences in delays between the branches of the fork are shorter than the delays in the operators to which the fork is an input. We call such forks isochronic [6].

Although we initially chose delay-insensitive techniques for reasons of methodology, those techniques present other important advantages in terms of efficiency and robustness:

The clock rate of a synchronous design has to be slowed to account for the worst-case clock skews in the circuit and for the slowest step in a sequence of actions. Since delay-insensitive circuits do not use clocks, they are potentially faster than their synchronous equivalents.

Since the logical correctness of the circuits is independent of the values of the physical parameters, delay-insensitive circuits are very robust to variations of these parameters caused by scaling or fabrication, or by some nondeterministic behavior such as the metastability of arbiters. For instance, all the chips we have designed have been found to be functional in a range of voltage values (for the constant voltage level encoding the high logical value) from above 10 V to below 1 V .

Delay-insensitive circuit design can be modular: A part of a circuit can be replaced by a logically equivalent one and safely incorporated into the design without changes of interfaces.
Because an operator of a delay-insensitive circuit is "fired" only when its firing contributes to the next step of the computation, the power
consumption of such a circuit can be much lower than that of its synchronous equivalent.

Since the correctness of the circuits is independent of propagation delays in wires and, thus, of the length of the wires, the layout of chips is facilitated.

The method indeed produces correct and efficient circuits. It has been applied, with both "hand" compilation and automatic compilation, to a series of difficult design problems such as distributed mutual exclusion, fair arbitration, routing automata, stacks, and serial multipliers. All fabricated chips have been found to be correct on "first silicon". Although our CMOS implementation of the basic operators has been overly cautious, and the electrical optimization techniques have been rather tame, the performance of the chips has been found at least equal to that of synchronous implementations. We have just completed the design of a general-purpose microprocessor, and its performances are very encouraging: In $1.6 \mu \mathrm{~m}$ SCMOS, it runs at 18 million instructions per second. (See the conclusion, Section 23, for more detail.)

The main reason for the efficiency of the method is that, rather than going in one step from program to circuit, the designer applies a series of transformations to the original program. At each stage, powerful algebraic manipulations can be performed leading to important optimizations in terms of speed or area.

In the first part of this chapter, we present the "source code" notation, the "object code" notation, and a VLSI implementation of the production rules in CMOS technology. The source notation is inspired by C. A. R. Hoare's CSP [4]: A program is a set of concurrent processes communicating by input and output commands on channels. (A similar experience in the use of communicating processes for programming in VLSI is described in [13].) The object code notation, called production rule set, is one of the main innovations of the method and is an interesting notation for digital VLSI all by itself.

In the second part, we describe the four main steps of the compilation (process decomposition, handshaking expansion, production rule expansion, operator reduction), illustrating them with a number of examples. In particular, we present the different algebraic transformations that can be applied at different stages of the compilation and that give the method its flexibility and efficiency.

## Part I: The Source Code and the Object Code

## 1 The Program Notation

For the sequential part of the notation, we use a subset of Edsger W. Dijkstra's guarded command language [3], with a slightly different syntax. We give only an informal definition of the constructs' semantics.
(i) $b \dagger$ stands for $b:=$ true, $b \mid$ stands for $b:=$ false. Those assignments are called "simple assignments".
(ii) The execution of the selection command $\left[G_{1} \rightarrow S_{1} \rrbracket \ldots \square G_{n} \rightarrow S_{n}\right]$, where $G_{1}$ through $G_{n}$ are boolean expressions, and $S_{1}$ through $S_{n}$ are program parts ( $G_{i}$ is called a "guard", and $G_{i} \rightarrow S_{i}$ a "guarded command"), amounts to the execution of an arbitrary $S_{i}$ for which $G_{i}$ holds. If $\urcorner\left(G_{1} \vee \ldots \vee G_{n}\right)$ holds, the execution of the command is suspended until ( $G_{1} \vee \ldots \vee G_{n}$ ) holds.
(iii) The execution of the repetition command $*\left[G_{1} \rightarrow S_{1} \llbracket \ldots \square G_{n} \rightarrow S_{n}\right]$, where $G_{1}$ through $G_{n}$ are boolean expressions, and $S_{1}$ through $S_{n}$ are program parts, amounts to repeatedly selecting an arbitrary $S_{i}$ for which $G_{i}$ holds and executing $S_{i}$. If $\urcorner\left(G_{1} \vee \ldots \vee G_{n}\right)$ holds, the repetition terminates.
(iv) Sequencing: Besides the usual sequential composition operator ' $x$; $y$ ', we introduce two other operators. For atomic actions $x$ and $y, ' x, y$ ' stands for the execution of $x$ and $y$ in any order leading to termination. For noninterfering communication actions $x$ and $y, " x \cdot y$ " stands for the simultaneous execution of $x$ and $y$. (We shall return to this definition when we discuss the implementation of communication in Section 19.)
(v) $[G]$, where $G$ is a boolean expression, stands for $[G \rightarrow$ skip] and thus for "wait until $G$ holds". (Hence " $[G] ; S$ " and $[G-S]$ are equivalent.)
(vi) $*[S]$ stands for $*[$ true $\rightarrow S]$ and thus for "repeat $S$ forever".
(vii) From (ii) and (iii), the operational description of the statement

$$
*\left[\left[G_{1} \rightarrow S_{1} \mathbb{\ldots} \square G_{n} \rightarrow S_{n}\right]\right]
$$

is "repeat forever: wait until some $G_{i}$ holds; execute an $S_{i}$ for which $G_{i}$ holds".
(viii) Tail recursion is allowed, but not general recursion. Functions and procedures with a simple parameter mechanism are also used, but we will not discuss them here.

### 1.1 Communicating Processes

A concurrent computation is described as a set of processes composed by the usual concurrent composition operator $\|$. The concurrent composition is weakly fair; i.e., if, in a given state of the computation, $x$ is the next atomic action of one of the processes, then $x$ will be executed after a possibly unbounded but finite number of atomic actions from other processes.

Processes communicate by communication actions on ports; they do not share variables. ${ }^{1}$ A port of a process is paired with a port of another process to form a channel. When no messages are transmitted, communication on a port is reduced to synchronization signals. The name of the port is then sufficient to identify a communication action.

If two processes, $p 1$ and $p 2$, share a channel with port $X$ in $p 1$ and port $Y$ in $p 2$, at any time the number of completed $X$-actions in $p 1$ equals the number of completed $Y$-actions in $p 2$. In other words, the completion of the $n$th $X$-action "coincides" with the completion of the $n$th $Y$-action. If, for example, $p 1$ reaches the $n$th $X$-action before $p 2$ reaches the $n$th $Y$-action, the completion of $X$ is suspended until $p 2$ reaches $Y$. The $X$-action is then said to be pending. When, thereafter, $p 2$ reaches $Y$, both $X$ and $Y$ are completed. The predicate " $X$ is pending" is denoted as $q X$. If, for an arbitrary command $A, c A$ denotes the number of completed $A$-actions, the semantics of a pair $(X, Y)$ of communication commands is expressed by the two axioms:

$$
\begin{gather*}
c X=c Y  \tag{A1}\\
\neg \mathbf{q} X \vee \neg \mathbf{q} Y \tag{A2}
\end{gather*}
$$

Surprisingly, it is possible (and even advantageous) to define communication actions as coincident and yet implement the actions in completely asynchronous ways.

### 1.2 Probe

Instead of the usual selection mechanism by which a set of pending communication actions can be selected for execution, we provide a general boolean command on ports, called the probe. The definition of the probe given in [5] states that in process $p l$, the probe command $\bar{X}$ has the same value as $\mathbf{q Y}$. For the time being, we use a weaker definition, namely:

$$
\begin{aligned}
\bar{X} & \Rightarrow \mathbf{q} Y \\
\mathbf{q} Y & \Rightarrow \Delta \bar{X},
\end{aligned}
$$

[^0]where $\diamond P$ means $P$ holds eventually. (We will return to the first definition in the example on the implementation of a fair arbiter.)

### 1.3 Communication

Matching communication actions are also used to implement a form of distributed assignment statement, to "pass messages", as it is often said. In that case, the pair of commands is specified to consist of an input command and an output command by adjoining them to the symbols "?" and "!", respectively. For example, $X$ ? is an input command and $X$ is therefore an input port, and $Y!$ is and output command and $Y$ is therefore and output port.

## Axiom Communication axiom

Let $X ? u$ and $Y!v$ be matching, where $u$ is a process variable and $v$ is an expression of the same type as $u$. The communication implements the assignment $u:=v$. In other words, if $v=V$ before the communication, then $u=V$ and $\nu=V$ after the communication.

### 1.4 First Example: Port Selection

Process sel repeatedly performs communication action $X$ or communication action $Y$, whichever can be completed; sel is blocked if and only if neither $X$ nor $Y$ can be completed:

$$
\text { sel } \equiv \underline{\equiv}[[\bar{X} \rightarrow X[\bar{Y} \rightarrow Y]] .
$$

Obviously, process sel is not fair because of the nondeterministic choice of a guard when both guards are true. Negated probes make it possible to transform sel into a fair version, fsel:

$$
\begin{array}{r}
\text { fsel } \equiv \quad *[[\bar{X} \rightarrow X ;[\bar{Y} \rightarrow Y[\neg \bar{Y} \rightarrow \text { skip }] \\
\square \bar{Y} \rightarrow Y ;[\bar{X} \rightarrow X \square \neg \bar{X} \rightarrow \text { skip }]
\end{array}
$$

l] .

Negated probes are necessary for implementing fairness.

### 1.5 Second Example: Lazy Stack

We implement a stack $S$ of size $n, n>0$, as a string of $n$ communicating processes defined as follows:

$$
S= \begin{cases}h, & \text { if } n=1 \\ (h \| T), & \text { if } n>1\end{cases}
$$

where $h$, the head of the stack, is a process, and $T$, the tail of the stack, is a stack of size $n-1$. Process $h$ communicates with the environment of the stack by the communication actions in? $x$ and out!x, and with $T$ by the communication actions put!x and get?x. Hence, h.put matches T.in, and h.get matches T.out. (We assume that no attempt is ever made to add a portion to a full stack, or to remove a portion from an empty stack.)

Each stack element either is empty and behaves like program $E$, or is full and behaves like program $F$. The epithet "lazy" is attributed to this stack because no reshuffling of portions takes place after a portion has been removed from a full stack element.

```
\(E \equiv \quad[\overline{i n} \rightarrow i n ? x ; F\)
    ] \(\overline{\text { out }} \rightarrow\) get?x; out! \(x ; E\)
    ]
\(F \equiv \overline{\text { out }} \rightarrow\) out \(!x ; E\)
    п \(\overline{i n} \rightarrow p u t!x ;\) in?x; \(F\)
    ].
```

The following alternative coding of the stack element process, due to Peter Hofstee, illustrates the advantages of the probe construct:

$$
\begin{aligned}
& *[[\overline{i n} \rightarrow \text { in } ? x \\
& \square \overline{o u t} \rightarrow \text { get } ? x \\
& ] ; \\
& {[\overline{o u t} \rightarrow \text { out }!x} \\
& \square \overline{i n} \rightarrow \text { put }!x \\
& ]] .
\end{aligned}
$$

We assume that each stack element is initially empty.

## 2 The Object Code: Production Rules

Carrying the discrete model of computation down to the transistor level requires that the MOS transistor be idealized as an on/off switch. Unfortunately, the simple semantics of the switch ignore too many electrical phenomena
that play an important role in the functioning of the circuit. A crucial innovation of the method is that the transistor need not be viewed as a discrete switch; voltages can change continuously from one stable level to the other one, provided that the changes are monotonic.

The notation for the object code provides the weakest possible form of control structure and the smallest possible number of program constructs. In fact, it contains exactly one construct, the production rule (PR), and one control structure, the production-rule set.

We consider the production-rule notation to be the canonical representation of a digital circuit. This representation can be decomposed into several equivalent networks of digital operators, depending on the set of building blocks used, but the production-rule set represents the circuit independently of the chosen implementation.

Definition A PR is a construct of the form $G \mapsto S$, where $S$ is either a simple assignment or an unordered list " $s 1, s 2, s 3, \ldots$ " of simple assignments, and $G$ is a boolean expression called the guard of the PR.

## Example

$$
\begin{aligned}
& x \wedge y \mapsto z \uparrow \\
& \quad \sim x \mapsto u \uparrow, v \downarrow
\end{aligned}
$$

The semantics of a PR are defined only if the PR is stable:
Definition APR $G \hookrightarrow S$ is said to be stable in a given computation, if, at any point of the computation, $G$ either is false or remains invariantly true until the completion of $S$.

Stability is not guaranteed by the implementation. It has to be enforced by the compilation procedure.

Definition An execution of the stable PR $G \mapsto S$ is an unbounded sequence of firings. A firing of $G \mapsto S$ with $G$ true amounts to the execution of $S$. A firing of $G \mapsto S$ with $G$ false amounts to a skip.

Definition A PR set is the concurrent composition of all PRs of the set.

### 2.1 Operations on PR Sets

The only composition operation on two PR sets is the set union.

## Theorem

The implementation of two concurrent processes is the set union of the two PR sets implementing the processes and of the PR sets implementing the channels between the processes, if any.

The proof follows from the associativity of the concurrent composition operator.

The other operations on the PRs of a set are those allowed by the following properties:

Multiple occurrences of the same $P R$ are equivalent to one as a consequence of the idempotence of the concurrent composition.
The two rules $G \mapsto S 1$ and $G \mapsto S 2$ are equivalent to the single rule $G \mapsto S 1, S 2$.
The two rules $G 1 \mapsto S$ and $G 2 \mapsto S$ are equivalent to the single rule $G 1 \vee G 2 \mapsto S$.

### 2.2 Noninterference

We require that complementary PRs -i.e., PRs of the type $G 1 \mapsto x \uparrow$ and $G 2 \mapsto$ $x \downarrow$ - be noninterfering.

Definition Two complementary PRs are noninterfering when $\urcorner G 1 \vee \neg G 2$ holds invariantly.

It can be proven that, under the stability of each PR and noninterference among complementary PRs, the concurrent execution of the PRs of a set is equivalent to the following sequential execution:
*[select a PR with a true guard; fire the $P R]$
where the selection is weakly fair (each PR is selected infinitely often). From now on, we ignore the firings of a PR with a false guard; a firing will mean a firing of a PR with a true guard.

Until we return to these issues, we shall assume that the stability and noninterference requirements are fulfilled.

## 3 VLSI Implementation of PRS

Stability and noninterference are the two properties that make the VLSI implementation of PRs (almost) straightforward. As an example, we describe how PRs can be implemented in CMOS technology.

### 3.1 The CMOS Transistors

A CMOS circuit is a network of "nodes" -variables- interconnected by transistors. Certain nodes are also connected to the input-output "pads", which provide the interface with the environment; we will ignore the pads in this presentation. Other nodes are directly connected to the power node, providing the constant high-voltage value-called VDD-that represents the logical constant true or 1 . Yet other nodes are directly connected to the ground node -called GND-providing the constant low-voltage value that represents the logical constant false or 0 .

A node takes the continuous range of voltage values between the high voltage and the low voltage. Above a certain voltage $\nu 1$ the value is interpreted as 1 . Below another voltage $\nu 0$, the value is interpreted as 0 . Thanks to the stability property, the precise values of $v 1$ and $\nu 0$, which vary from node to node, are irrelevant provided that $\nu 0<\nu 1$ and the voltage changes are monotonic. (Strict monotonicity is not necessary and is actually impossible to achieve because of noise, but we will not enter into these details here.)

A CMOS transistor is of either $n$-type or $p$-type. A transistor relates three nodes in the following way. Let $g$, standing for "gate", and $x$ and $y$ be the three nodes. When $g$ is false for an $n$-transistor, and true for a $p$-transistor, no current passes through the region between $x$ and $y$, called the channel; ${ }^{2}$ thus $x$ and $y$ are left unchanged.

When $g$ is set to true for an $n$-transistor, or false for a $p$-transistor, the channel becomes conducting. In this case, either $x$ and $y$ have the same voltages and are left unchanged, or a current is established in the channel until $x$ and $y$ reach the same voltage. The common value reached by $x$ and $y$ depends on electrical properties of $x$ and $y$ that are determined by the physical sizes (capacitances) of the nodes implementing $x$ and $y$ and by their interactions with the rest of the circuit. (Differences in node capacitances may cause charges to flow through the channel of a transistor in a way that results in unintended values of the nodes. This phenomenon, called charge sharing, may make it quite difficult to predict the final voltage value reached by $x$ and y.)

In order to define the net effect of a PR independently of the physical parameters of its implementation, we are going to restrict the use of transistors. (In particular, the restriction will eliminate most occurrences of charge sharing.)

We impose the condition that a transistor used in isolation connect only two variables of the circuit: the gate $g$ and one of the other two nodes, say $z$.

[^1]The third node of the transistor is either the power or the ground. With this restriction, the behavior of a single $\boldsymbol{n}$-transistor is

$$
g \mapsto z \dagger \text { or } g \mapsto z \downarrow
$$

The behavior of a single $p$-transistor is

$$
\neg g \mapsto z \dagger \text { or } \neg g \mapsto z \downarrow
$$

### 3.2 Threshold Voltages

The current in the channel of a transistor is a function of the so-called gate-to-source voltage, $V_{g 5}$, defined as $V(g)-\min (V(x), V(y)$ ) for an $n$-transistor and as $V(g)-\max (V(x), V(y))$ for a $p$-transistor. In first approximation, the current is assumed to be zero when

$$
V_{g s} \leq V_{t n}
$$

for an $n$-transistor and

$$
V_{g s} \geq V_{t p}
$$

for a $p$-transistor. $V_{t n}$ and $V_{t p}$ are called the threshold voltages. (Typically, $V_{t n} \approx 1 V$ and $V_{t p} \approx-1 V$.)

Because of the existence of threshold voltages, if an $n$-transistor is used to implement $g \mapsto z \uparrow$, the final value of $z$ is not a "strong" 1 , since the channel will stop conducting as soon as the voltage of $z$ is within $V_{t n}$ of the gate voltage. And symmetrically, a $p$-transistor used to implement $\neg g \mapsto z \downarrow$ does not produce a "strong" zero as the final value of $z$. Since the voltage drops caused by the threshold voltages accumulate as we compose operators, it is important to produce strong signals in order to be able to compose an arbitrary number of operators. We shall therefore restrict our use of $n$-transistors to PRs of the form

$$
\begin{equation*}
g \mapsto z \downarrow \tag{1}
\end{equation*}
$$

and $p$-transistors to production rules of the form

$$
\begin{equation*}
\neg g \mapsto z \dagger \tag{2}
\end{equation*}
$$

With these restrictions, all implementations produce strong signals.
Threshold voltages are difficult to adjust in CMOS technology. Actually, they tend to become more variable as the feature size decreases. (They may also vary during the activity of the circuit because of some electrical interaction with the substrate, called body effect.) For constant node capacitance,
variations in thresholds account for most of the discrepancies in propagation delays on a CMOS chip. In particular, these variations exclude the possibility that the ordering in space of a set of variables along a common wire be used to infer an ordering in time of a set of transitions of these variables.

### 3.3 Switching Circuits

Consider the canonical (stable) PR

$$
\begin{equation*}
b \mapsto z \downarrow \tag{3}
\end{equation*}
$$

where $b$ is a boolean expression in terms of a set of variables. These variables are used as gates of transistors implementing a switching circuit $s$ corresponding to $b: s$ is a series-parallel switching circuit between the ground node and $z$. The switches are $n$-transistors whose gates are the variables of $b$, possibly negated. Furthermore, we have

$$
b \equiv \text { "there is a path from ground to } z \text { in } s "
$$

By the construction of $s$, if $b$ holds and remains stable, $z$ is eventually set to 0 . (For this reason, $s$ is called a pull-down circuit.) Hence, $s$ is exactly the implementation of production rule (3).

Using a symmetrical argument, we can show that the same series-parallel circuit as $s$, but with the power node and $z$ connected, and whose switches are $p$-transistors, implements the production rule

$$
\begin{equation*}
\text { bneg } \mapsto z \uparrow \tag{4}
\end{equation*}
$$

where $b n e g$ is derived from $b$ by negating all variables. (This circuit is called a pull-up circuit.)

## 4 Operators

Two PRs that set and reset the same variable, such as

$$
\begin{array}{lll}
b 1 & \mapsto & z \dagger  \tag{5}\\
b 2 & \mapsto & z \downarrow,
\end{array}
$$

are implemented as one operator.
Let $s 1$ be the pull-up circuit corresponding to $b 1$, and let $s 2$ be the pulldown circuit corresponding to $b 2$. The two circuits are connected through the common node $z$ (see Figure 1). Since noninterference has been enforced, $\neg b 1 \vee \neg b 2$ holds at any time. This guarantees the absence of a conducting path
between power and ground when the operator is not firing. (A path may exist for a short time when the operator is firing.)

Definition The operator implementing the two rules is called "combinational" if $b 1 \vee b 2$ holds at any time, and "state-holding" otherwise.

By definition, if (5) is combinational, there is always a conducting path between either VDD or GND and the output $z$. Hence, the value of the output is always a strong 0 or a strong 1 , and therefore $s 1$ and $s 2$ are together a valid implementation of (5).

For example, PRs (1) and (2) together implement an inverter as represented in Figure 2. The circuit of Figure 3 implements the nand-operator defined by the PRs

$$
\begin{array}{rll}
a \wedge b & \mapsto & z \downarrow \\
\neg a \vee \neg b & \mapsto & z!.
\end{array}
$$

If (5) is a state-holding operator, $\neg b 1 \wedge \neg b 2$ may hold in a certain state. In such a state, node $z$ is isolated; there is no path between $z$ and eitherVDD or GND. In MOS technology, an isolated node does not retain its value forever; eventually the charges leak away through the substrate and also through the transistors of the pull-up and pull-down circuits. If the PRs of the operator are fired frequently enough to prevent leakage, the implementation of Figure 1 can be used for a state-holding operator. Such an implementation is called dynamic.

Figure 1. CMOS implementation of a combinational operator.


Otherwise, it is necessary to add a storage element to the output node of a state-holding operator. Such an implementation is called static. In the sequel, we assume that only static implementations are used for state-holding operators.
(A standard CMOS implementation of such a storage element consists of two cross-coupled inverters (see Figure 4). This implementation inverts the value of $z$. The "weak" inverter, marked with a letter $w$ on the figure, connects $z$ to either VDD or GND through a high resistance, so as to maintain $z$ at its intended voltage value [18].)

The implementation of a static state-holding operator is slightly more costly than that of a combinational operator because of the need for a storage device. Hence, given a pair of PRs that are not combinational, we may first try to modify the guards - under the invariance of the semantics- so as to make them combinational.

## 5 The Standard Operators

All operators of one or two inputs are used, and are therefore viewed as the standard operators.

Figure 2. A CMOS inverter.


### 5.1 One-Input Operators

The two operators with one input and one output are the wire:

$$
\begin{aligned}
x \underline{w} y \equiv x & \mapsto y \dagger \\
\neg x & \mapsto y \downarrow,
\end{aligned}
$$

and the inverter.

$$
\begin{aligned}
\tau x \underline{w} y \equiv \tau x & \mapsto y \dagger \\
x & \mapsto y \downarrow
\end{aligned}
$$

Most operators we use have more inputs than outputs. In general, however, the components we design have as many outputs as inputs. Hence, we need to reset the balance by introducing at least one operator, the fork, with more outputs than inputs. A fork with two outputs is defined as

$$
\begin{aligned}
x \underline{f}(y, z) \equiv \quad x & \mapsto y \downarrow, z\rceil \\
\tau x & \mapsto y \downarrow, z \downarrow
\end{aligned}
$$

The wire and the fork are the only two operators that are implemented not as a pull-up/pull-down circuit -called a restoring circuit- but as a simple conducting interconnection between input and outputs.

Figure 3. CMOS implementation of a nand-gate.


### 5.2 The Wire as a Renaming Operator

Because the implementation of a wire is the same as that of a node, the wire behaves as a renaming operator when composed with another operator: The composition of an arbitrary operator $O$ with output variable $x$ with the wire $x \underline{w} y$ is equivalent to $O$ in which $x$ is renamed $y$. The composition of operator $O$ with input variable $x$ with the wire $y \underline{w} x$ is equivalent to $O$ in which $x$ is renamed $y$. (Observe that $O$ can even be a wire.)

Unfortunately, the fork is not a renaming operator since the concurrent assignments to the different outputs of the fork are not completed simultaneously. In order to use a fork as a renaming operator, we will later have to make the timing assumption that such a fork is isochronic.

### 5.3 Combinational Operators with Two Inputs

We construct all functions $B$ of two variables $x$ and $y$ such that

$$
\begin{array}{rll}
B & \mapsto & z \dagger \\
\neg B & \mapsto & z \downarrow .
\end{array}
$$

We get for $\mathrm{B}: ~ x \wedge y, x \vee y$, and $x=y$. We will not list the functions obtained by inverting inputs of $B$. (In the figures, a negated input or output is represented by a small circle on the corresponding line.) This gives the following set:

Figure 4. A static implementation of a state-holding operator.


The and, with the infix notation $(x, y) \wedge z$, is defined as

$$
\begin{array}{rll}
x \wedge y & \mapsto & z \dagger \\
\neg x \vee \neg y & \mapsto & z \downarrow .
\end{array}
$$

The or, with the infix notation $(x, y) \underline{z}$, is defined as

$$
\begin{array}{rlr}
x \vee y & \mapsto & z \uparrow \\
\neg x \wedge \neg y & \mapsto & z \downarrow .
\end{array}
$$

The equality, with the infix notation $(x, y)$ eq $z$, is defined as

$$
\begin{array}{lll}
x=y & \mapsto & z \dagger \\
x \neq y & \mapsto & z \downarrow
\end{array}
$$

### 5.4 State-Holding Operators with Two Inputs

Next, we construct all different two-input-one-output operators of the form

$$
\begin{array}{lll}
b 1 & \mapsto & z \dagger \\
b 2 & \mapsto & z \downarrow
\end{array}
$$

such that $\neg b 1 \vee \neg b 2$ holds at any time, but $b 1 \neq \neg b 2$. We select for $b 1$ either $x \wedge y$, or $x \vee y$, or $x=y$. For each choice of $b 1$, we construct $b 2$ as any of the effective strengthenings of $\boldsymbol{\tau} \boldsymbol{b l}$.

For $b 1 \equiv(x \wedge y)$, we get for $b 2: \neg x \wedge \neg y, \neg x \wedge y, \neg x$, and $x \neq y$. The first three choices of $b 2$ lead to the following state-holding operators:

The $C$-element.

$$
\left.\begin{array}{rl}
(x, y) C z & =\quad x \wedge y
\end{array}\right) \quad z \uparrow
$$

(The C-element, introduced by David Muller, is described in [15].)
The switch:

$$
\begin{array}{rl}
(x, y) \underline{s w} z & x \wedge y
\end{array} \mapsto z \dagger
$$

The asymmetric C-element:

$$
\left.\begin{array}{rl}
(x, y) \underline{a C} z & =x \wedge y
\end{array}\right) \neq z \dagger
$$

For $b 2=(x \neq y)$, we get the operator

$$
\begin{array}{lll}
x \wedge y & \mapsto & z \uparrow \\
x \neq y & \mapsto & z \downarrow .
\end{array}
$$

If the stability condition is fulfilled, however, this operator is not stateholding. Because of the stability requirement, the state in which $\tau x \wedge \neg y$ holds -the "storage state"- can be reached only from states $x \wedge \neg y$ and $\neg x \wedge y$. In both states, $\neg z$ holds, and, therefore, $\neg z$ holds in the storage state. Hence, we can weaken the guard of the second PR as $(x \neq y) \vee(\neg x \wedge \neg y)$, i.e., $\neg x \vee \neg y$. Hence, the operator is equivalent to the and-operator $(x, y) \wedge z$.
For $b 1 \equiv(x \vee y)$, no effective strengthening of $\urcorner b 1$ is possible.
For $b l \equiv(x=y)$, we get the operator:

$$
\begin{array}{rll}
x=y & \mapsto & z \uparrow \\
x \wedge \neg y & \mapsto & z \downarrow
\end{array}
$$

If the stability condition is fulfilled, however, this operator is not stateholding for the same reasons that the operator with $b 1 \equiv x \wedge y$ and $b 2 \equiv(x \neq y)$ is not.

### 5.5 FIIp-FIOP

The canonical form we choose for the flip-flop is

$$
\begin{aligned}
& (x, y) \underline{f} z=x \quad z \dagger \\
& \rightarrow \boldsymbol{y} \mapsto z \downarrow \text {, }
\end{aligned}
$$

which requires the invariance of $\tau x \vee y$ to satisfy noninterference. Observe that the flip-flop $(x, y)$ ff $z$ can always be replaced with the $C$-element $(x, y) \subseteq \subset$, but not vice versa.

## 6 Multi-Input Operators

Since there are already 164 different operators with three inputs and one output, we shall not pursue the systematic enumeration that we started with two-input operators. We use $n$-input and, or, C-element, whose definitions are straightforward.

We use a multi-input flip-flop defined as

$$
\begin{array}{rl}
\left(x_{1}, \ldots, x_{k}, y_{1}, \ldots, y_{l}\right) \underline{m f f} z & V i: x_{i}
\end{array} \mapsto z \dagger
$$

where $\left(\forall i: \neg x_{i}\right) \vee\left(\forall i: y_{i}\right)$.
We also use the combinational if-operator-sometimes called multiplexerdefined as

$$
\begin{aligned}
(x, y, z) \text { if } u & = & (x \wedge y) \vee(\neg x \wedge z) & \mapsto
\end{aligned} u \uparrow \begin{aligned}
(x \wedge \neg y) \vee(\neg x \wedge \neg z) & \mapsto
\end{aligned} u \downarrow .
$$

The most general and most often used operator is the generalized C-element, of which all other forms of C-elements are a special case. It implements a pair of PRs

$$
\begin{array}{lll}
B 1 & \mapsto & x \uparrow \\
B 2 & \mapsto & x \downarrow
\end{array}
$$

in which $B 1$ and $B 2$ are arbitrary conjunctions of elementary terms. (As usual, the two guards have to be mutually exclusive.) For example,

$$
\begin{array}{rll}
a \wedge b \wedge \neg c & \mapsto & x \dagger \\
\neg a \wedge d & \mapsto & x \downarrow
\end{array}
$$

can be directly implemented with a generalized C-element. Observe that the limiting factor for the size of the guards is not the number of inputs, but the number of terms in a conjunction.

## 7 Arbiter and Synchronizer

So far, we have considered only PR sets in which all guards are stable and noninterfering. But we shall have to implement sets of guarded commands -selections or repetitions- in which the guards are not mutually exclusive, as in the probe-selection example. Therefore, we need at least one operator that provides a nondeterministic choice between two true guards.

### 7.1 Arbiter

The simplest selection between nonexclusive guards is of the form

$$
\begin{gathered}
*[[x \rightarrow \cdots \\
\quad \| y \rightarrow \cdots
\end{gathered}
$$

11,
where $x$ and $y$ are simple boolean variables, and the two guards are stable. In order to distinguish among the three basic states of the system-i.e., neither $x$ nor $y$ is selected, $x$ is selected, or $y$ is selected- we must introduce two outputs, say $u$ and $v$, as follows:

$$
\begin{gathered}
*[[x \rightarrow u \dagger ; \cdots \\
\quad 0 y \rightarrow v \dagger ; \cdots
\end{gathered}
$$

l].
Initially, $\neg u \wedge \neg v$ holds as coding of the state "no selection made". Hence, when the selection is considered completed, which is just a matter of definition, $u$ and $v$ should be set back to false. We get

$$
\begin{align*}
& *[[x \rightarrow u \dagger ;[\neg x] ; u \downarrow \\
& \quad \| y \rightarrow v \uparrow ;[\neg y] ; v \downarrow \tag{6}
\end{align*}
$$

11 .
If $\neg u \wedge \neg v$ holds initially, $\neg u \vee \neg v$ holds at any time.
The preceding program is a description of the operator known as the "basic arbiter" or "mutual-exclusion element," denoted as $(x, y)$ arb ( $u, v$ ). Observe that the choice between the two guards is not fair.

### 7.2 Synchronizer

When negated probes are used, for instance to implement fairness, we have to implement selection commands with unstable guards. The synchronizer is the only operator that accepts nonstable guards. It is defined as
$*[[b \wedge z \rightarrow u \dagger ;[\neg z] ; u\rfloor$
$\quad \| \neg b \wedge z \rightarrow v \dagger ;[\neg z] ; v \downarrow$
]].

Variable $b$ may change at any time from false to true, but both $b$ and $z$ remain true until $u$ or $v$ has changed. Hence, the guard $\neg b \wedge z$ is unstable, whereas the guard $b \wedge z$ is stable. As in the arbiter case, if $\neg u \wedge \neg v$ holds initially, $\neg u \vee \neg \vee$ holds at any time. (The synchronizer operator was introduced in [7].)

### 7.3 Implementation and Metastability

The PR sets for (6) and (7) necessarily contain unstable rules. The PR set for the "unstable arbiter" is

```
x^ッท \mapsto и |
y^\negu \mapsto v }
~x\veev \mapstou!
フу\veeи \mapsto v!.
```

The PR set for the "unstable synchronizer" is

$$
\begin{array}{rlll}
b \wedge z \wedge \neg v & \mapsto & u \dagger \\
\neg b \wedge z \wedge \neg u & \mapsto & v \dagger \\
\neg z \vee v & \mapsto & u \downarrow \\
\neg z \vee u & \mapsto & v \downarrow .
\end{array}
$$

The first two PRs of the arbiter are unstable and can fire concurrently. The same holds for the first two production rules of the synchronizer: Since $b$ can change from false to true at any time, both guards may evaluate to true.

Let us analyze the PR set implementation of the arbiter. The synchronizer case is very similar. The state $x \wedge y \wedge(u=v)$ of the arbiter is called metastable. When started in the metastable state, with $\neg u \wedge \neg \nu$, the set of PRs specifying the arbiter may produce the following unbounded sequence of firings:

$$
*[(u \dagger, v \dagger) ;(u \downarrow, v \downarrow)] .
$$

In the implementation, nodes $u$ and $v$ may stabilize to a common intermediate voltage value for an unbounded period of time. Eventually, the inherent asymmetry of the physical realization (impurities, fabrication flaws, thermal noise, etc.) will force the system into one of the two stable states where $u \neq v$. But there is no upper bound on the time the metastable state will last, which means that it is impossible to include an arbitration device into a clocked system with absolute certainty that a timing failure cannot occur.

The spurious values of $u$ and $v$ produced during the metastable state must be eliminated since they violate the requirement $\neg u \vee \neg v$. Hence, we compose
the "bare" arbiter with a "filter" taking $u$ and $v$ as input and producing $u f$ and $v f$ as "filtered outputs". The net effect of the filter is

$$
u f, v f:=(u \wedge \neg v),(v \wedge \neg u) .
$$

(In the CMOS construction of the filter shown in Figure 5, we use the threshold voltages to our advantage: The channel of transistor $t 1$ is conducting only when ( $u \wedge \neg v$ ) holds, and the channel of transistor $t 2$ is conducting only when ( $v \wedge \neg u$ ) holds.)

In delay-insensitive design, the correct functioning of a circuit containing an arbiter or a synchronizer is independent of the duration of the metastable state; therefore, relatively simple implementations of arbiters and synchronizers can be used. In synchronous design, however, the implementations have to meet the additional constraint that the probability of the metastable state lasting longer than the clock period should be negligible.

Figure 5. An implementation of the basic arbiter.

filter

## 8 Sequencing and Stability

In the second part of this chapter, we shall see how an arbitrary program in the source notation can be decomposed -by a transformation called handshaking expansion-into a collection of sequences of the type

$$
S \equiv *\left[\left[w_{0}\right] ; t_{0} ;\left[w_{1}\right] ; t_{1} ; \ldots ;\left[w_{n-1}\right] ; t_{n-1}\right]
$$

The $w_{i}$, the wait-conditions, are boolean expressions, possibly identical to true, and the $t_{i}$ are simple assignments. The extension to the case of multiple assignments between the wait-conditions is straightforward.

The next step of the compilation procedure -the production-rule expan-sion- (also to be explained in the second part) is the transformation of $S$ into a semantically equivalent set of production rules. Let

$$
P \equiv\left\{b_{i} \mapsto t_{i} \mid 0 \leq i<n\right\}
$$

be such a set.
Notations and Definitions For an arbitrary PR $p, p . g$ and $p . a$ denote the guard and the assignment of $p$, respectively. The predicate $R(a)$, the $r e$ sult of the simple assignment $a$, is defined as: $R(x \dagger)=x$, and $R(x \downarrow)=\neg x$. An execution of a PR that changes the value of the assigned variable is called effective; otherwise, it is called vacuous.

With these definitions, the stability of a PR can be reformulated as follows:
Stability APR $p$ is stable in a computation if and only if $p . g$ can be falsified only in states where $R(p . a)$ holds.

The production-rule expansion algorithm compiles a handshaking expansion $S$ into a set $P$ of PRs, all of which are stable except those whose guards contain negated probes. Since, as we shall see, the guards of the PRs are obtained by strengthening the wait-conditions of $S$, the stability of the waitconditions is necessary to satisfy the stability of the PRs.

A wait-condition $w$ is stable if once $w$ is true, it remains true at least until the completion of the following assignment. Unstable wait-conditions can be caused by negated probes only. These cases are dealt with separately by introducing synchronizers. (An example of how this is achieved is given in Section 22.)

### 8.1 Sequencing

The set $P$ of PRs implements $S$ when the following conditions are fulfilled:

1. Guard strengthening: The guards of the PRs of $P$ are obtained by strengthening the wait conditions of $S: \forall i:: b_{i} \Rightarrow w_{i}$ and, in the initial state, $w_{0} \Rightarrow b_{0}$.
2. Sequential execution: $\left(\mathbf{N} i:: b_{i} \wedge \neg R\left(t_{j}\right)\right) \leq 1$, i.e., at most one effective $P R$ can be executed at a time.
3. Program-order execution: The order of execution of effective PRs of $P$ is the order specified by $S$, called the program order, and no deadlock is introduced in the construction of $P$.

As we shall see in Part 2, it is not always possible to construct, for a given handshaking expansion, a PR set that satisfies the preceding three conditions. In certain cases, the handshaking expansion must be augmented with assignments to new variables, called state variables. This transformation, which is always possible, will be explained in Part 2.

### 8.2 Acknowledgment

Fulfilling the second and third conditions requires that for any two PRs $p$ : $b \mapsto t$ and $p^{\prime}: b^{\prime} \mapsto t^{\prime}$, such that $p$ immediately precedes $p^{\prime}$ in the program order,

$$
b^{\prime} \Rightarrow R(t)
$$

holds in the states where $p^{\prime}$ is effectively executed. We say that $b^{\prime}$ is the acknowledgment of $t$. Hence the following property:

Acknowledgment Property For a PR set executed in program order, the guard of each $P R$ is an acknowledgment of the immediately preceding assignment.

We shall see that the acknowledgment property is necessary but not sufficient to ensure program-order execution.

We use two kinds of acknowledgments, depending on the type of variable used in the assignment. But other forms of acknowledgments can be envisioned. If $t$ assigns an internal variable, then the acknowledgment is implemented by strengthening $b^{\prime}$ as $b^{\prime} \wedge R(t)$.

For example, if $t$ is $x \dagger$, the acknowledgment is $b^{\prime} \wedge x$.
If $t$ assigns an external variable, i.e., a variable that implements a communication command, another kind of acknowledgment, which we shall introduce later, can be used. For instance, if $I o$ is an output variable used together with input variable $l i$ to implement a so-called active handshaking protocol, a possible acknowledgment of $l o \dagger$ is $l i$, since $l i \Rightarrow l o$ at this point of the protocol.

### 8.3 Implementation of Stability

Consider a PR set $P$, which implements a given program $S$. We are going to show that the acknowledgment property, which is necessary to construct a $P$ that implements $S$, is also sufficient to guarantee stability.

The execution of a PR $p$ of $P$ establishes a path between a constant node (either VDD or GND), and the node implementing the variable -say, $x$ - assigned by $p$. Either $p . g$ holds forever after $p$, or the firing of another PR $I$, the invalidating PR of $p$, will establish $\tau p . g$, thereby cutting the path from the constant node to $x$.

Let $\tilde{p}$ be the complementary PR of $p$, i.e., the PR with the complementary assignment. If the PR set contains both $p$ and $\tilde{p}$, then it also contains $I$ because of the noninterference requirement between complementary PRs. And we have the order of execution:

$$
p \leq I \prec \bar{p} .
$$

In all the states between $I$ and $\tilde{p}$, the original path to $x$ is cut. In that case, we have to see to it that the assignment to $x$ is completed before the path is cut. Hence the following requirement:

Completion requirement Assignment p. $a$ is completed when a $\operatorname{PR} q$ is completed whose guard is an acknowledgment of p.a. The execution order of the PR set must satisfy

$$
p<q \leq I .
$$

Since this requirement is already implied by the acknowledgment property, the construction of $P$ automatically guarantees stability.

### 8.4 Self-Invalidating PRs

Definition A PR $p$ is self-invalidating when $R(p . a) \Rightarrow \neg p . g$.
For example, $\neg x \mapsto x \dagger$ is self-invalidating.
Self-invalidating PRs are excluded by the completion requirement since it implies $I \neq p$.

For instance, the circuit consisting of an inverter with its output connected to its input is excluded by the completion requirement since it corresponds
to the PR set:

$$
\begin{array}{rrrr}
7 x & \mapsto & x \dagger \\
x & \mapsto & x \downarrow
\end{array}
$$

and the two PRs of the set are self-invalidating. However, the PR set

$$
\begin{array}{rlll}
7 x & \mapsto & y \uparrow \\
y & \mapsto & x \uparrow \\
x & \mapsto & y \downarrow \\
\neg y & \mapsto & x \downarrow
\end{array}
$$

fulfills the completion requirement, although it is the same circuit as previously, since the only change is the addition of the wire $y \underline{w} x$.

We eliminate such "disguised" self-invalidating PRs by adding the following requirement:

Restoring Acknowledgment Requirement There is at least one restoring PR $r$ satisfying $p<r \preceq I$, where $r$ is restoring if it is not part of a wire or a fork.

With this extra requirement, all forms of self-invalidating PRs are eliminated.

It is remarkable that the acknowledgment requirement, which is necessary to enforce the sequential execution of a PR set, is also sufficient to satisfy stability. From now on, we can manipulate PRs as if the transitions were discrete. We have, however, made no simplifying assumption on the physical behavior of the system. The only physical requirement so far is that of monotonicity.

Another requirement on the implementation is that the rings of operators that constitute a circuit keep oscillating. It turns out that eliminating self-invalidating PRs enforces the condition that a ring contain at least three restoring operators, which is a necessary (and in practice also sufficient) condition for the ring to oscillate, thanks to the "gain" property of restoring gates. (See [14] for an explanation of gain.)

## Part II: The Compilation Method

In this part, we describe how a program in the source notation is transformed into a semantically equivalent set of VLSI operators. Four major trans-
an intermediate program representation, between communicating processes and PRs, that allows for important algebraic manipulations of the program: reshuffling, process factorization, and process quotient. We illustrate the method with a series of examples that covers practically all cases.

## 9 Process Decomposition

The first step of the compilation, called process decomposition, consists in replacing one process with several processes by application of the following rule:

Decomposition Rule A process $P$ containing an arbitrary program part $S$ is semantically equivalent to two processes, $P 1$ and $P 2$, where $P 1$ is derived from $P$ by replacing $S$ with a communication action, $C$, on a newly introduced channel $(C, D)$ between $P 1$ and $P 2$, and $P 2$ is the process $*[[\bar{D} \rightarrow$ $S ; D]$.

The structure of $P 2$ will be used so frequently that we introduce an operator to denote it: the call operator. We denote it by $(D / S)$, and we say that $D$ calls (or activates) $S$.

Observe that process decomposition does not introduce concurrency. Although P1 and P2 are potentially concurrent, they are never active concurrently; $P 2$ is activated from $P 1$, much as a procedure or a coroutine would be. The newly created subprocesses may share variables, but, since the subprocesses are never active concurrently, there is no conflicting access to the shared variables. The subprocesses may also share channels; this will require a special implementation for such channels. Decomposition is applied for each construct of the language. For construct $S$, the corresponding process $P 2$ can be simplified as follows:

If $S$ is the selection $\left[B_{1} \rightarrow S_{1} \rrbracket B_{2} \rightarrow S_{2}\right.$ ], $P 2$ is simplified as

$$
\begin{align*}
* & {\left[\left[\bar{D} \wedge B_{1} \rightarrow S_{1} ; D\right.\right.} \\
& \square \bar{D} \wedge B_{2} \rightarrow S_{2} ; D \tag{8}
\end{align*}
$$

]].

If $S$ is the repetition $*\left[B_{1} \rightarrow S_{1} \rrbracket B_{2} \rightarrow S_{2}\right]$, $P 2$ is simplified as

$$
\begin{aligned}
& *\left[\left[\bar{D} \wedge B_{1} \rightarrow S_{1}\right.\right. \\
& \quad \rrbracket \bar{D} \wedge B_{2} \rightarrow S_{2} \\
& \quad \llbracket \bar{D} \wedge \neg B_{1} \wedge \neg B_{2} \rightarrow D \\
& \\
& \text { ] }] .
\end{aligned}
$$

The assignment $x:=B$, where $B$ is an arbitrary boolean expression, is implemented as the selection $[B \rightarrow x \upharpoonleft \cap \neg B \rightarrow x \downarrow]$, which gives for $P 2$

$$
\begin{aligned}
& *[[\bar{D} \wedge B \rightarrow x \dagger ; D \\
& \quad \llbracket \bar{D} \wedge \neg B \rightarrow x \ddagger ; D
\end{aligned}
$$

l].

The generalizations to the cases of an arbitrary number of guarded commands in selection and repetition are obvious. All assignments to the same variable are also grouped in the same process. Process decomposition is applied repeatedly until the right-hand side of each guarded command is a straight-line program.

Process decomposition makes it possible to reduce a process with an arbitrary control structure to a set of subprocesses of only two different types: either a (finite or infinite) sequence of communication actions, or a repetition of type (8) or (9).

## 10 Handshaking Expansion

The next step of the transformation, the handshaking expansion, replaces each communication action in a program with its implementation in terms of elementary actions, and each channel with a pair of wire operators. We shall first ignore the issue of message transmission and implement only the synchronization property of communication primitives.

Channel ( $X, Y$ ) is implemented by the two wires ( $x \circ \underline{w} y i$ ) and ( $y o \underline{w} x i$ ). If $X$ belongs to process $P 1$ and $Y$ to process $P 2$, then $x o$ and $x i$ belong to $P 1$, and $y o$ and $y i$ to P2. Initially, xo, xi, yo, and $y i$-which we will call the "handshaking variables of $(X, Y)^{n}$ - are false. Assume that the program has been proven to be deadlock-free and that we can identify a pair of matching actions $X$ and $Y$ in $P 1$ and $P 2$, respectively. We replace $X$ and $Y$ by the sequences $U_{x}$ and $U_{y}$,
respectively, where

$$
\begin{align*}
& U_{x} \equiv x o \dagger ;[x i]  \tag{10}\\
& U_{y} \equiv[y i] ; y o \dagger .
\end{align*}
$$

Also,

$$
\begin{align*}
& x 0 \mapsto \\
& \text { 7xo } \mapsto  \tag{11}\\
& \text { yi! } \\
& \text { yo } \mapsto
\end{align*}
$$

by definition of the wires. By (10) and (11), any concurrent execution of P1 and $P 2$ contains the following sequence of assignments:
xo $; y i \dagger ; y o t ; x i \dagger$.

### 10.1 Simultaneous Completion of Nonatomic Actions

We introduce a definition of completion of a nonatomic action which makes it possible to use the notion of simultaneous completion of two nonatomic actions.

By definition, the execution of an atomic action is considered instantaneous, and thus the simultaneous completion of two atomic actions does not make sense. (Atomic actions are simple assignments $\times \dagger$ and $x \downarrow$, and evaluation of simple guards, i.e., guards containing one variable. A wait action of the form [ai] is a nonatomic action that may be treated as the repetition *[ $\neg a i \rightarrow s k i p]$.)

A nonatomic action is initiated when its first atomic action is executed. A nonatomic action is terminated when its last atomic action is executed.

For nonatomic actions, the notion of completion does not coincide with that of termination. A nonatomic action might be considered completed even if it has not terminated, i.e., even if some atomic actions that are part of the action have not been executed. The definition of suspension is derived from that of completion.

Definition A nonatomic action $X$ is completed when it is initiated and is guaranteed to terminate, i.e., when all possible continuations of the computation contain the complete sequence of atomic actions of $X$.

The preceding definition can be further explained as follows: Consider a prefix $t 1$ of an arbitrary trace of a computation. (A trace is a sequence of
atomic actions corresponding to a possible execution of the program.) The completion of $X$ is identified with the point in the computation where $t 1$ has been completed, if (1) $X$ is initiated in $t 1$, and (2) all possible sequences $t 2$, such that $t 1$ extended with $t 2$ is a valid trace of the computation, contain the remaining atomic actions of $X$. Hence the completions of two nonatomic actions coincide if their completion points coincide.
(Observe that there may be several points in a trace that can act as completion point, which makes it easier to align the two completion points of two overlapping sequences so as to implement the bullet operator.)

Definition Between initiation and completion, an action is suspended.
These definitions of completion and suspension are valid because they satisfy the three semantic properties of completion and suspension that are used in correctness arguments, namely:

1. $\{c X=x\} X\{c X=x+1\}$,
2. $\mathbf{q} X \Rightarrow \operatorname{pre}(X)$, where $\operatorname{pre}(X)$ is any precondition of $X$ in terms of the program variables and auxiliary program variables,
3. If $X$ is completed, eventually $X$ is terminated.

These definitions will be used to implement the bullet operator and the communication primitives as defined by axioms $A 1$ and $A 2$. Consider the interleaving of $U_{x}$ and $U_{y}$. At the first semicolon, i.e., after xo $\uparrow, U_{x}$ has been initiated, but it cannot be considered completed since the valid continuation that does not contain $U_{y}$ does not contain the rest of $U_{x}$. At the second semicolon, both $U_{x}$ and $U_{y}$ have been initiated, and thus all continuations contain the rest of the interleaving of $U_{x}$ and $U_{y}$. Hence, $U_{x}$ and $U_{y}$ are guaranteed to terminate when they are both initiated, i.e., they fulfill A1 and A2.

### 10.2 Four-Phase Handshaking

Unfortunately, when the communication implemented by $U_{x}$ and $U_{y}$ terminates, all handshaking variables are true. Hence, we cannot implement the next communication on channel $(X, Y)$ with $U_{x}$ and $U_{y}$. The complementary implementation, however, can be used for the next matching pair, that is:

$$
\begin{aligned}
& D_{x} \equiv x o l ;[\neg x i] \\
& D_{y} \equiv[ר y i] ; y o l .
\end{aligned}
$$

The solution consisting in alternating $U_{x}$ and $D_{x}$ as an implementation of $X$, and $U_{y}$ and $D_{y}$ as an implementation of $Y$, is called two-phase handshaking,
or two-cycle signaling. Since it is in most cases impossible to determine syntactically which $X$ - or $Y$-actions follow each other in an execution, the general two-phase handshaking implementations require testing the current value of the variables as follows:

$$
\begin{aligned}
& x o:=7 x o ;[x i=x o] \\
& {[y i \neq y o] ; y o:=7 y o .}
\end{aligned}
$$

In general, we prefer to use a simpler solution, known as four-phase handshaking, or four-cycle signaling. In a four-phase handshaking protocol, $X$ actions are implemented as " $U_{x} ; D_{x}$ " and $Y$-actions as " $U_{y} ; D_{y}$ ". Observe that the $D$-parts in $X$ and $Y$ introduce an extra communication between the two processes whose only purpose is to reset all variables to false.

Both protocols have the property that for a matching pair $(X, Y)$ of actions, the implementation is not symmetrical in $X$ and $Y$. One action is called active and the other one passive. The four-phase implementation, with $X$ active and $Y$ passive, is

$$
\begin{align*}
& X \equiv x o \dagger ;[x i] ; x o l ;[7 x i]  \tag{12}\\
& Y \equiv[y i] ; y o \dagger ;[\neg y i] ; y o \downarrow . \tag{13}
\end{align*}
$$

(Later, we will introduce an alternative form of active implementation, called lazy-active.) Although four-phase handshaking contains twice as many actions as two-phase handshaking, the actions involved are simpler and are more amenable to the algebraic manipulations we shall introduce later. When operator delays dominate the communication costs, which is the case for communication inside a chip, four-phase handshaking will, in general, lead to more efficient solutions. When transmission delays dominate the communication costs, which is the case for communication between chips, two-phase handshaking is preferred.

### 10.3 Probe

A simple implementation of the probe $\bar{X}$ is $x i$, with $X$ implemented as passive. (Given our definition of suspension, the proof that this implementation of the probe fulfills its definition is straightforward.)

A probed communication action $\bar{X} \rightarrow \ldots X$ is then implemented as
$x i \rightarrow \ldots x o \uparrow ;[\neg x i] ; x o \downarrow$.

### 10.4 Choice of Active versus Passive Implementation

When no action of a matching pair is probed, the choice of which action should be active and which passive is arbitrary, but a choice has to be made. The choice can be important for the composition of identical circuits. A simple rule is that, for a given channel $(X, Y)$, all actions on one port (called the active port) are active, and all actions on the other port (called the passive port) are passive. If $\bar{X}$ is used, all $X$-actions are passive - with the obvious restriction that $\bar{Y}$ cannot be used in the same program.

We shall see, however, that this criterion for choosing active and passive ports may conflict with another criterion related to the implementation of input and output commands.

### 10.5 Properties of the Handshaking Protocol

For a matching pair ( $X, Y$ ) of actions implemented as (12) and (13), and the wires ( $x \circ \underline{w} y i$ ) and ( $y o \underline{w} x i$ ), the concurrent execution of $X$ and $Y$ causes the sequence of assignments
xo ; yit; yo $;$ xi†; xol; yil; yol; xil,
called the handshaking protocol. The following properties of the handshaking protocol play an important role in the compilation method.

Property 1 For $x o$ and $x i$ used as in the active protocol of (12), $x i$ is an acknowledgment of $x o \dagger$ and $w i$ is an acknowledgment of $x o l$. For $y o$ and $y i$ used as in the passive protocol of (13), $\neg y i$ is an acknowledgment of $y o \dagger$ and $y i$ is an acknowledgment of yol.

Property 2 In (12) and (13), $D_{x}$ and $D_{y}$ are used only to reset all variables to false. Hence, provided that the cyclic order of the actions of (12) and (13) is maintained, the sequences $D_{x}$ and $D_{y}$ can be inserted at any place in the program of each of the processes without invalidating the semantics of the communication involved. This transformation, called reshuffling, may introduce a deadlock.

Property 3 The wait-actions of (12) and (13) are stable. Reshuffling maintains the stability.

Reshuffling, which is the source of significant optimizations, will be used extensively. It is therefore important to know when Property 2 can be applied without introducing deadlock.

There are two simple cases where the reshuffling of sequence " $U_{x} ; D_{x} ; S^{\prime \prime}$ into sequence " $U_{x} ; S ; D_{x}$ " does not introduce deadlock:
$S$ contains no communication action, or
$X$ is an internal channel introduced by process decomposition.

## 11 Production-Rule Expansion

Production-rule expansion is the transformation from a handshaking expansion to a set of PRs. It is the most crucial and most difficult step of the compilation since it requires the enforcement of sequencing by semantic means. It consists of three steps:

## 1. State assignment,

2. Guard strengthening,
3. Symmetrization.

We shall explain the algorithms for production-rule expansion with an example: the implementation of the simple process $(L / R)$, where $R$ is an active channel. This process is one of the basic building blocks for implementing sequencing. The handshaking expansion gives

$$
\begin{equation*}
*[[l i] ; \text { rot; }\{\text { ri }] ; \text { rol; }[\neg r i] ; l o \dagger ;[\neg l i] ; \text { lol }] . \tag{14}
\end{equation*}
$$

We now consider the handshaking expansion as the specification of the implementation: Any implementation of the program has to satisfy the ordering defined by (14). The next step is to construct a production-rule set that satisfies this ordering. We start with the production-rule set that is syntactically derived from (14):

$$
\begin{array}{rlll}
l i & \mapsto & r o \dagger \\
r i & \mapsto & r o \downarrow \\
\neg i & \mapsto & 10 \uparrow \\
\tau l i & \mapsto & l o \downarrow .
\end{array}
$$

(As a clue to the reader, PRs of a set are listed in program order.)
Since the program is deadlock-free, effective execution of the PRs in program order is always possible. Some other execution orders, however, may also be possible. The production-rule set satisfies the handshaking-expansion specification if, and only if, the only possible execution order is the program order. If execution orders other than the program order are possible for the
production-rule set, the guards of some rules are strengthened so as to eliminate these execution orders.

In our example, program order is not the only execution order for the syntactic production-rule set: Since 7 ri holds initially, the third PR can be executed first. This is also true for the fourth PR, but the execution of the fourth rule in the initial state is vacuous. Because all handshaking variables of $R$ are back to false when $R$ is completed, we cannot find a guard for the transition lo that holds only as a precondition of lo $\dagger$ in (14). Hence, we cannot distinguish the state following $R$ from the state preceding $R$, and thus the sequential execution condition introduced in Section 8 cannot be satisfied.

This is a general problem, since it arises for each unshuffled communication action. In order to fulfill the sequential-execution condition, we have to guarantee that each state of the handshaking expansion is unique, i.e., that there exists a predicate in terms of variables of the program that holds only in this state. The task of transforming the handshaking expansion so as to make each state unique is called state assignment.

### 11.1 State Assignment with State Variables

The first technique to define uniquely the state in which the transition $l o \dagger$ is to take place consists in introducing a state variable, say $x$, initially false. Handshaking expansion (14) becomes

$$
\begin{equation*}
*[l i] ; \text { rot; }[r i] ; x \uparrow ;[x] ; \text { rol; }[\neg r i] ; \text { lo } ;[\neg l i] ; x \downarrow ;[\neg x] ; l o \downarrow] . \tag{15}
\end{equation*}
$$

Observe that (15) is semantically equivalent to (14) since the two sequences of actions that are added to (14), namely, $x\rceil ;[x]$ and $x] ;[7 x]$, are equivalent to a skip. (The newly introduced variable $x$ is used nowhere else.)

There are several places where the two assignments to the state variable can be introduced. In general, a good heuristic is to introduce those assignments at such places that the alternation between waits and assignments is maintained. There are other heuristics, however, that can play a role in the placement of the variables.

Once state variables have been introduced so as to distinguish any two states of the handshaking expansion, it is possible to strengthen the guards of the PRs to enforce program-order execution. The basic algorithm for guard strengthening can be found in [10]. We shall not describe it here. Applied to
(15), it gives

$$
\begin{array}{rll}
\neg x \wedge l i & \mapsto & r o \dagger \\
r i & \mapsto x \uparrow \\
x & \mapsto & r o \downarrow \\
x \wedge \neg r i & \mapsto & l o \dagger \\
\neg l i & \mapsto x! \\
\neg x & \mapsto & l o \downarrow . \tag{21}
\end{array}
$$

It is easy to check that the acknowledgment property is fulfilled and that the only possible execution order for the preceding production-rule set is the program order defined by (15).

## 12 Operator Reduction

The last step of the compilation, called operator reduction, groups together the PRs that assign the same variables. Those PRs are then identified with (and implemented as) an operator. The program is thus identified with a set of operators.

Since we have enforced the stability of each rule and noninterference between any two complementary rules, we can implement any set of PRs directly. (For reasons of efficiency, we must see to it that the guards do not contain too many variables in a conjunct, which would lead to too many transistors in series. Hence, the implementation of the set may also involve decomposing a PR into several PRs by introducing new internal variables.)

The direct implementation of the PR set (16) through (21) is straightforward:
(16) and (18) correspond to the asymmetric C-element ( $\neg x, l i$ ) $a C$ ro.
(19) and (21) correspond to the asymmetric C-element ( $x, \neg$ ri) $a C l o$.
(17) and (20) correspond to the flip-flop (ri, li) ff $x$.

If the preceding operators are implemented as dynamic, this implementation of process $(L / R)$ is the simplest possible. If static implementations of the operators are required, another implementation might be considered with fewer state-holding elements since, as we have explained in the first part, static state-holding operators are slightly more difficult to realize than combinational operators.

A last transformation, called symmetrization, may be performed on the PR set to minimize the number of state-holding operators. Since symmetrization also introduces inefficiencies of its own, however, it should not be applied blindly.

## 13 Symmetrization

Symmetrization is performed on the two guards of PRs $b 1 \mapsto z \uparrow$ and $b 2 \mapsto$ $z \downarrow$, when one of the two guards, say, $b 1$, is already in the form $x \wedge \neg b 2$. If we replace guard $b 2$ with $\urcorner x \vee b 2$, then the two guards are complements of each other, i.e., the operator is combinational. Of course, weakening guard $b 2$ is a dangerous transformation since it may introduce a new state where the guard holds. We have to check that this does not occur by checking the following invariant:

Given the new rule $\neg x \vee b 2 \rightarrow z \downarrow, ~ \neg z$ must hold in any state where $\neg x \wedge \neg b 2$ holds, i.e., we have to check the invariant truth of

```
x\veeb2\vee\negz.
```


### 13.1 Operator Reduction of the (L/R)-element

The symmetrization of PRs (16) and (18), and of (19) and (21) of the ( $L / R$ )element, gives

$$
\begin{array}{rll}
\neg x \wedge l i & \mapsto & r o \dagger \\
r i & \mapsto & x \uparrow \\
\neg l i \vee x & \mapsto & r o \downarrow \\
x \wedge \neg r i & \mapsto & l o \dagger \\
\neg l i & \mapsto & x \downarrow \\
r i \vee \neg x & \mapsto & l o \downarrow . \tag{21}
\end{array}
$$

(16) and (18) correspond to the and-operator $(\neg x, l i) \Delta r o$.
(17) and (20) correspond to the flip-flop (ri,li) ff $x$.
(19) and (21) correspond to the and-operator ( $x, 7 r i$ ) $\Delta l o$.
(17) and (20) can also be implemented as the C-element (ii,ri) $\subseteq x$.

The resulting circuit is shown in Figure 6. (The dot identifies the input that is activated first.) This implementation of $(L / R)$, either with a flip-flop or with a C-element, is called a Q-element. The Q-element implementing ( $L / R$ ) as before is described by the infix notation $(l i, l o) \underline{Q}(r i, r o)$.

## 14 Isochronic Forks

In the previous operator reduction, $l i$ is an input to the flip-flop ( $l i, r i$ ) $\underline{f} x$ and to the and-operator $(i, \neg x) \Delta r$. Formally, in order to compose the PRs
together to form a circuit, we have to introduce the fork $I \underline{f}(I 1, I 2)$ and replace Ii by $I 1$ as input of the and-operator, and by $l 2$ as input of the flip-flop. We also have to introduce the forks $r i \underline{f}(r 1, r 2)$ and $x \underline{f}(x 1, x 2)$ for the same reason.

Let us analyze the effect of the first fork only. The PR set that includes the PRs of the fork is

$$
\begin{align*}
& l i \mapsto \cdot 11 \dagger, 12 \dagger  \tag{16a}\\
& \rightarrow \times \wedge 11 \mapsto r o \uparrow  \tag{16b}\\
& r i \mapsto x \dagger  \tag{17}\\
& \neg l l \vee x \mapsto r o l  \tag{18}\\
& x \wedge \neg r i \mapsto 10 \uparrow  \tag{19}\\
& \text { ᄀli } \mapsto \quad 11 \downarrow 12 \downarrow  \tag{20a}\\
& \rightarrow 12 \mapsto x \mid  \tag{20b}\\
& r i v \rightarrow x \rightarrow 1 o l . \tag{21}
\end{align*}
$$

Now we observe that transition $11 \dagger$ of ( 16 a ) is acknowledged by the guard of ( 16 b ) but $12 \dagger$ is not, and transition $12 \downarrow$ of (20a) is acknowledged by the guard of (20b) but $I l \downarrow$ is not. Hence, the assignments $I 2 \uparrow$ and $I l \downarrow$ do not fulfill the completion requirement and thus are not stable!

We solve this problem by making a simplifying assumption: We assume that the fork is isochronic. That is, the difference in delays between the two branches of the fork is shorter than the delays in the operators to which the fork is an input. Hence, when a transition on one output is acknowledged and

Figure 6. Implementation of $(L / R)$ with a $Q$-element.

thus completed, the transition on the other output is also acknowledged and thus completed.

This is the only timing condition that must be fulfilled. In general, the constraint is easy to meet because it is one-sided. The isochronicity requirement is more difficult to meet, however, when a negated input introduces an inverter on a branch of the fork, since the transition delays of an inverter are of the same order of magnitude as the transition delays of other operators. We have proved that, for the implementation of each language construct, these inverters can always be eliminated from the isochronic forks by simple transformations. ${ }^{3}$ (See [1, 2].)

In [11], we have proved that the class of entirely delay-insensitive circuits is very limited: Practically all circuits of interest fall outside the class. We believe that the notion of isochronic fork is the weakest compromise to delayinsensitivity sufficient to implement any circuit of interest.

Which forks have to be isochronic is easy to decide by a simple analysis of the PR sets. For instance, the fork ri$\underline{f}(r 1, r 2)$ also has to be isochronic, but the fork $x f(x 1, x 2)$ does not. We shall ignore the issue of isochronic forks in the rest of this presentation.

## 15 Reshuffled Implementations of (L/R)

We illustrate the use of reshuffling by deriving two other implementations of $(L / R)$. If $L$ is an internal channel introduced for process decomposition, we can reshuffle the handshaking expansions of $L$ and $R$ without the risk of introducing deadlock. Let us return to handshaking expansion (14).

### 15.1 First Reshuffing

We postpone the second half of the handshaking expansion of $R$-i.e., the sequence rol; $[\neg r i]$ - until after $[\neg / i]$. We get
*[lit]; rot; [ri]; lo†; [ᄀli]; rol; [רri]; lol].
The syntactic PR expansion we now derive is already "program-ordered":

$$
\begin{array}{rlll}
l i & \mapsto & r o \dagger \\
r i & \mapsto & I \circ \dagger \\
\neg i & \mapsto & r o \downarrow \\
\neg i & \mapsto & l o \downarrow .
\end{array}
$$

[^2]The first and third rules specify the wire (Iiwro); the second and fourth rules specify the wire (riw$l o$ ). Hence, the implementation reduces to two wires!

### 15.2 Second Reshuffing: The D-element

We now postpone the whole handshaking expansion of $R$ until after $[\neg / i]$. We get
*[lii]; lo†; [ᄀli]; rot; [ri]; rol; [ᄀri]; lol]. We need to introduce a state variable, say $x$, as follows:


The PR expansion gives

$$
\begin{array}{rll}
l i & \mapsto & x \dagger \\
(r i v) x & \mapsto & l o \dagger \\
x \wedge \neg / i & \mapsto & r o \dagger \\
r i & \mapsto & x \downarrow \\
(l i v) \neg x & \mapsto & r o \downarrow \\
\neg x \wedge \neg r i & \mapsto & l o \downarrow .
\end{array}
$$

The terms between parentheses have been added for symmetrization. The operator reduction gives

$$
\begin{array}{rll}
(I i, \neg r i) & \underline{f f} & x \\
(r i, x) & \underline{v} & l o \\
(x, \neg i) & \Delta & r o .
\end{array}
$$

The flip-flop can be replaced with the C-element (li, $\neg$ ri) $\underline{C} x$. The circuit, shown in Figure 7, is called a D-element.

## 16 Sequencing

There are many ways to implement the sequencing of $n$ arbitrary actions. We shall introduce the basic operators that are used in the most straightforward implementations.

### 16.1 The Active-Active Buffer

Consider the program $*\left[S_{1} ; S_{2}\right]$, where $S_{1}$ and $S_{2}$ are two arbitrary program parts. Process decomposition of this program gives

$$
*[L ; R]\left\|\left(L^{\prime} / S_{1}\right)\right\|\left(R^{\prime} / S_{2}\right)
$$

Hence the basic sequencing operator is the process

$$
B\left(L_{a}, R_{a}\right) \equiv *[L ; R],
$$

where both $L$ and $R$ are active. This process is called an active-active buffer. The handshaking expansion gives

* [lo†; [li]; lol; [าII]; rot; [ri]; rol; [าri]].

Since $r$ is false initially, we can rewrite (22) as

By comparing (23) with (14) -the handshaking expansion of the Q-elementwe observe that $B\left(L_{a}, R_{a}\right) \equiv(\neg r i, r o) \underline{Q}(I i, l o)$, which gives the implementation of Figure 8.

### 16.2 The (L/A;R)-element

In order to generalize the preceding construction to the case of an arbitrary number of actions, we must implement the generalization of the $(L / R)$ -

Figure 7. The $D$-element.

element. Sequence

* $\left[S_{1} ; S_{2} ; \ldots ; S_{n}\right]$
can be decomposed into a number of shorter sequences by repeatedly applying process decomposition. There are as many ways to decompose (24) as there are binary trees of $n$ leaves. But observe that, if $n>2$, all decompositions will require at least one process of the form
$(L / A ; R)$,
where $A$ and $R$ are active communication actions. (The semicolon binds more tightly than the process call.) We shall use two different reshufflings to implement this process. Again, these reshufflings maintain the semantics of the original program if the handshaking expansion of $L^{\prime}$ is not reshuffled. The first reshuffling is
*[lli]; aoई; [ai]; lo†; [ᄀII]; aol; [רai]; R; lol].
We decompose it into two sequences by applying a process-factorization decomposition described in [10]:
(*[lii]; ao†; [ᄀli]; aol]
|| * [lai]; lo†; [רai]; R; lol]
).

The first sequence is the wire (liwao). The second sequence is the D-element $(a i, l o) \underline{D}(r i, r o)$.

Figure 8. Implementation of the active-active buffer with a $Q$-element.


The second reshuffling is
*[lli]; A; ro†; [ri]; lo†; [नli]; rol; [רri]; lol].
Again, we decompose it into two sequences by process factorization:
(*[lri]; lo†; [רri]; lol]
$\| *[1 / i] ; A ; r o t ;[\neg i] ;$ rol $]$
).
The first sequence is the wire ( $r i \underline{w} l o$ ). The second sequence is the Q -element (li, ro) $Q(a i, a o)$. Both implementations are shown in Figure 9.

Now the implementation of a sequence of $n$ actions is straightforward. For instance, for $n=4$, we have two "linear" decompositions of ( $L / S_{1} ; S_{2} ; S_{3} ; S_{4}$ ). The first one is

$$
\left(\left(L / S_{1} ; L_{1}\right)\left\|\left(L_{1} / S_{2} ; L_{2}\right)\right\|\left(L_{2} / S_{3} ; S_{4}\right)\right)
$$

The second one is

$$
\left(\left(L / L_{2} ; S_{4}\right)\left\|\left(L_{2} / L_{1} ; S_{3}\right)\right\|\left(L_{1} / S_{1} ; S_{2}\right)\right)
$$

These two decompositions lead to the linear implementations shown in Figure 10 .

### 16.3 The Passive-Active Buffer

In order to compose one-place buffers in a linear chain, one channel must be active and the other one passive. We implement the buffer with $L$ passive and $R$ active. This version is denoted by $B\left(L_{p}, R_{a}\right)$. In order to take advantage of

Figure 9. Implementations of the $(L / A ; R)$-element.

the active-active case, we decompose the buffer into two processes $q$ and $t$ :

$$
\begin{aligned}
q & \equiv *\left[D^{\prime} ; R\right] \\
t & \equiv(D / L)
\end{aligned}
$$

Process $q$ is an active-active buffer. The compilation of $t$ is straightforward. The handshaking expansion gives
*[[di]; [li]; lo†; [רII]; lol; do†; [حdi]; dol].
Since $D$ is an internal channel, we can reshuffle the sequence $[\neg i] ; l o \downarrow$ with respect to $D$ without introducing deadlock. (Also observe that since do $\downarrow$ remains the last action of the sequence, we have not changed the order of $L$ relative to $R$.) We get
*[[di]; [li]; lo $;$ do d; [רdi]; [ᄀli]; lol; dol].
The PR expansion leading to the circuit of Figure 6 is

$$
\begin{aligned}
d i \wedge l i & \mapsto l o l, d o \dagger \\
\neg d i \wedge \neg l i & \mapsto l o l, d o l .
\end{aligned}
$$

Process $t$ is used to connect the two ports of a channel when they are both active. It is called a "passive-passive adaptor". The complete circuit is shown in Figure 11.

The passive-active buffer can be compiled directly by introducing a state variable. The circuit obtained is slightly different. See [8].

Figure 10. Implementations of $\left(L / S_{1} ; S_{2} ; S_{3} ; S_{4}\right)$.


## 17 Single-Variable Register

Consider the following register process, which provides read and write access to a simple boolean variable, $x$ :

$$
\begin{gather*}
*[[\bar{P} \rightarrow P ? x \\
\quad \| \bar{Q} \rightarrow Q!x \tag{25}
\end{gather*}
$$

]],
where $\bar{\sim} \vee \neg \bar{Q}$ holds at any time.
The handshaking expansion of (25) uses the double-rail technique: The boolean value of $x$ is encoded on two wires, one for the value true and one for the value faise. Input channel $P$ has two input wires, pil for receiving the value true and pi2 for receiving the value false, and one output wire, po. Output channel $Q$ has two output wires, $q 01$ for sending the value true and $q \circ 2$ for sending the value false, and one input wire, qi. Each guarded command of (25) is expanded to two guarded commands:

$$
\begin{aligned}
& \text { *[[pil } \rightarrow x \dagger ;[x] ; p o \dagger ;[\neg p i 1] ; p o \downarrow
\end{aligned}
$$

$$
\begin{aligned}
& \text { ]]. }
\end{aligned}
$$

Figure 11. An implementation of the passive-active buffer.


### 17.1 Mutual Exclusion between Guarded Commands

We are now faced with a new problem: enforcing mutual exclusion between the production-rule sets of different guarded commands. (This problem is not concerned with making the guards of the different commands mutually exclusive. For the time being, we are considering only examples where the guards of the commands are already mutually exclusive.) Let us illustrate our problem with the compilation of the first two guarded commands. If we just concatenate the production-rule sets of these two commands, we get

$$
\begin{array}{rll}
p i 1 & \mapsto & x \dagger \\
p i 1 \wedge x & \mapsto & p o \dagger \\
\neg p i 1 & \mapsto & p o \downarrow \\
p i 2 & \mapsto & x \downarrow \\
p i 2 \wedge \neg x & \mapsto & p o \uparrow \\
\neg p i 2 & \mapsto & p o \downarrow .
\end{array}
$$

We now observe, however, that the second and the sixth guarded commands are interfering (they set and reset the same variable po), and that, for reasons of symmetry, the same holds for the third and the fifth PRs.

Hence, the problem of ensuring mutual exclusion between PRs of different guarded commands is the same as enforcing program order between PRs of the same guarded command. We use the same technique, which consists in strengthening the guards of the production rules, if necessary, by introducing state variables to distinguish between the states corresponding to each true guard.

In the case at hand, we strengthen the guards of the third and the sixth rules as

$$
\begin{aligned}
& x \wedge \neg p i 1-p o l \\
& \text { 7×^าpi2 } \mapsto \text { pol. }
\end{aligned}
$$

The rest of the implementation is straightforward. The first and fourth PRs correspond to the flip-flop (pil, $\mathrm{qpi}^{2}$ ) $\underline{f} \mathrm{f} x$. The other PRs can be transformed into

$$
\begin{array}{rll}
(p i 1 \wedge x) \vee(p i 2 \wedge \neg x) & \mapsto & p o \dagger \\
(\neg p i 1 \wedge \neg x) \vee(\neg p i 2 \wedge x) & \mapsto & p o l,
\end{array}
$$

which is the definition of the if-operator (pil, pi2, x) if po.

The production-rule expansion of the last two guarded commands of (26) gives

$$
\begin{aligned}
x \wedge q i & \mapsto q 011 \\
\neg x \vee \neg q i & \mapsto q 01! \\
\neg x \wedge q i & \mapsto q 021 \\
x \vee \neg q i & \mapsto q 02 \downarrow
\end{aligned}
$$

which corresponds to the two operators $(x, q i) \Delta q 01$ and $(\neg x, q i) \Delta q 02$. The circuit is represented in Figure 12.

In the next example, we shall refer to the implementation of the first two guarded commands of (26) as the register operator:

$$
(p i 1, p i 2) \operatorname{reg}(p o, x) .
$$

We shall refer to the implementation of the last two guarded commands of (26) as the read operator:

$$
(q i, x) \text { read }(q o 1, q o 2)
$$

## 18 Implementation of the Stack

The implementation of the stack will be used to explain the general method for implementing communications that involve passing messages. The method

Figure 12. Single boolean register.

relies on the time-honored "divide-and-conquer" principle: We first construct the so-called control part of the program, which is the original program where the messages have been removed from each communication action. We then combine this control part with a data path, which is a program implementing the assignment parts of the communication actions. (See Figure 16 in Section 20.) The basic technique for combining control and data was introduced in [9].

### 18.1 The Control Part of the Stack

The control part of the stack consists of programs $E$ and $F$, from which message communication has been removed. We assume that the stack is empty initially. We introduce the channel $\left(t, t^{\prime}\right)$ so that $F$ can be called from within $E$ by process decomposition. We get

$$
\begin{array}{cc}
E= & *[[\overline{i n} \rightarrow \text { in } ; t \\
& \square \overline{\text { out }} \rightarrow \text { get } ; \text { out } \\
& ]] \\
F \equiv & *\left[\left[\overline{t^{\prime}} \wedge \overline{\text { in }} \rightarrow\right.\right. \text { put; in } \\
& \| \overline{t^{\prime}} \wedge \overline{o u t} \rightarrow \text { out } ; t^{\prime} \\
& ]] .
\end{array}
$$

In the handshaking expansion, we let the choice of active and passive communications be dictated by the occurrence of the probes. (We will, however, return to this choice later.) We get

 ]]
$F \equiv *\left[\left[t i^{\prime} \wedge\right.\right.$ ini $\rightarrow$ puto $\dagger$ [puti]; putol; [רputi]; ino $;$ [ 1 ini]; inol

]] .
Observe that, after handshaking expansion, the symmetry between $E$ and $F$ has been restored. The choice of whether $t i$ or $t i^{\prime}$ should be negated in the guards determines whether $E$ or $F$ should be called initially, i.e., whether we start with an empty or a full stack element.

### 18.2 Compilation of E

The first guarded command, $E 1$, is a standard passive-active buffer. The second guarded command, E2, is a standard Q-element. The implementation of $E$ must combine the implementations of $E 1$ and $E 2$ in a way that enforces mutual exclusion between the execution of $E 1$ and that of $E 2$.

Since the execution of in and that of out are mutually exclusive, it suffices to guarantee that when in is completed in E1, E2 cannot start until $t$ is completed. We introduce the variable $z$ (initially true) in the handshaking expansion of $E 1$, as indicated in Figure 13, and we strengthen the guard of $E 2$ with $z$. We get



Now $E 2$ cannot start until $z \dagger$ is completed, i.e., until $E 1$ is completed. Since, by the structure of $E 1, z \Rightarrow 7 t i$, we can simplify the guard of $E 2$ to outi^ $z$. For symmetrization, we also weaken 7 outi as $\quad$ outiv $\neg z$. Hence, mutual exclusion is enforced by replacing input outi with the and-operator (outi,z) $\wedge$ outi' in the $Q$-element implementation of $E 2$. This gives the circuit of Figure 14 as an implementation of $E$.

Figure 13. Implementation of the first g.c. of $E$ with variable $z$.


### 18.3 Compilation of $F$

The compilation of $F 1$ is identical to that of $E 2$ with the appropriate change of variables. The compilation of $F 2$, however, can be simplified by reshuffling. Since channel ( $t, t^{\prime}$ ) is internal, we can reshuffle the handshaking sequence of $t^{\prime}$ without deadlock. The handshaking expansion of $F 2$ becomes

$$
t i^{\prime} \wedge \text { outi } \rightarrow \text { outo†; to } \dagger ;\left[\uparrow t i^{\prime} \wedge \text { nouti]; outol; to' } \downarrow,\right.
$$

which compiles immediately into the "forked" C-element ( $t i^{\prime}$, outi) $\underline{C}$ (outo, $t o^{\prime}$ ). The reshuffling guarantees that $F 1$ cannot be started before $F 2$ is completed.

The channels in and out are used in both $E$ and $F$, so we must merge the local copies of in and the local copies of out in a standard way that we do not describe here. The resulting circuit for the control part of the stack element is shown in Figure 15.

Figure 14. Implementation of $E$.


## 19 Implementation of the Data Path

We now have to extend the implementation of the control part $S 2$ so as to obtain an implementation of the whole program $S 1$. We want to leave $S 2$ unchanged by introducing a datapath process, $P$, such that the parallel composition of $S 2$ and $P$ implements $S 1$.

The channels in, out, get, put of $S 2$ are renamed in' $^{\prime}$,out', get', put'. $P$ com-

Figure 15. The control part of the stack element.

municates with $S 2$ via in' $^{\prime}$,out ${ }^{\prime}$, get $^{\prime}$, put $^{\prime}$ and with the environment via in, out, get, put. (See Figure 16.)

Let $C$ be a channel of $S 1$, and $C^{\prime}$ be the renamed channel of $S 2$ to which $C$ corresponds. For ( $S 2 \| P$ ) to implement $S 1$, each communication on $C$ must coincide with a communication on $C^{\prime}$; i.e., $P$ must implement the so-called channel interface process

$$
I_{C} \equiv *\left[C \cdot C^{\prime}\right]
$$

Hence, $P$ has to implement the four channel interfaces:

* $\left[\mathrm{in}^{\prime} \cdot \mathrm{in} ? \mathrm{x}\right]$
* [out' out! $x$ ]
* get' $\cdot$ get?x]
* $\left[p u t^{\prime} \bullet p u t!x\right]$.


## 20 Implementation of Channel Interfaces

There are four types of channel interfaces, depending on whether the port is active or passive, and whether the communication is an input or an output.

Figure 16. Adding the data path.


### 20.1 Input Actions on a Passive Port

We want to implement the interface $I_{C}$ for action $C ? x$ on the passive port $C$. $I_{C}$ communicates with $S 2$ by the active port $C^{\prime}$, and with the environment by the passive port $D$. Furthermore, in the standard double-rail encoding technique, the two-wire implementation ( $c i, c o$ ) of $C$ has to be interfaced to the three-wire input port $D$ in which the two input wires, dil and di2, are used to encode the two values of the incoming message. (See Figure 17.)
$I_{C}$ has to implement an interleaving of the following three sequences:

$$
\begin{aligned}
S_{C} & \left.\equiv *\left[c i^{\prime} \uparrow ;\left[c o^{\prime}\right] ; c i^{\prime}\right\rfloor ;\left[\neg c o^{\prime}\right]\right] \\
S_{D} & \equiv *[[d i 1 \vee d i 2] ; d o \dagger ;[\neg d i 1 \wedge \neg d i 2] ; d o l] \\
S_{x} & \equiv *[[d i 1 \rightarrow x \uparrow ;[x] \downarrow d i 2 \rightarrow x] ;[\neg x]]]
\end{aligned}
$$

An implementation of $C^{\prime} \bullet D$ interleaves sequences $S_{C}$ and $S_{D}$ as

In the interleaving of (28) and $S_{x}$, the assignment to $x$ is inserted after [ $c o^{\prime}$ ] so as to ensure that communication action $C$ has been started when the assignment to $x$ is performed:

$$
\begin{gather*}
*\left[[d i 1 \vee d i 2] ; c i^{\prime} \uparrow ;\left[c o^{\prime} \wedge d i 1 \rightarrow x \rrbracket ;[x]\left[c^{\prime} \wedge d i 2 \rightarrow x \mid ;[\neg x]\right] ;\right.\right.  \tag{29}\\
\left.d o \uparrow ;[\neg d i 1 \wedge \neg d i 2] ; c i^{\prime} \downarrow ;\left[\neg c o^{\prime}\right] ; d o \downarrow\right]
\end{gather*}
$$

Figure 17. Channel interface for input port.


Next, we factor (29) as

* [ $\left.[d i 1 \vee d i 2] ; c i \prime \dagger ;[\neg d i 1 \wedge \neg d i 2] ; c i^{\prime} \downarrow\right]$
and
*[[co' $\wedge$ dil $\rightarrow x \uparrow ;[x] ;$ do $\dagger ;\left[\neg c o^{\prime}\right] ;$ dol
$\left\lceil c^{\prime} \wedge\right.$ di2 $\rightarrow x \downarrow ;[\neg x] ; d o \dagger ;\left[\neg c o^{\prime}\right] ; d o \downarrow$
]].
Sequence (30) is realized by the operator (dil, di2) $\underline{\vee} c^{\prime}$. We factor (31) so as to isolate the register part:

$$
\begin{aligned}
& \left(c o^{\prime}, d i 1\right) \underline{a C} x 1 \equiv *\left[\left[c o^{\prime} \wedge d i l\right] ; x 1 \dagger ;\left\{\neg c o^{\prime}\right] ; x 1 \downarrow\right] \\
& \left(c o^{\prime}, d i 2\right) \underline{a C} \times 2 \equiv *\left[\left[c o^{\prime} \wedge d i 2\right] ; x 2 \dagger ;\left[\neg c o^{\prime}\right] ; x 2 \|\right] \\
& (x 1, x 2) \underline{r e g}(x, d o) \equiv *[[x 1 \rightarrow x\rceil ;[x] ; \text { do } \dagger ;[7 x 1] \text {; do }\rfloor \\
& \llbracket x 2 \rightarrow x!;[7 x] ; \text { do } ;[2 x 2] ; \text { dol }
\end{aligned}
$$

11. 

The implementation is shown in Figure 18.

### 20.2 Input Actions on an Active Port

For port $C$ active, the communication variables of the interface $I_{C}$ remain the same. But now the handshaking expansions of $C^{\prime}$ and $D$ are different, since $C^{\prime}$ is passive and $D$ is active. We get
$S_{C} \equiv *\left[\left[c o^{\prime}\right] ; c i^{\prime} \uparrow ;\left[7 c o^{\prime}\right] ; c i^{\prime} l\right]$
$S_{D} \equiv *[d o \dagger ;[d i 1 \vee d i 2] ;$ dol; $[\neg d i 1 \wedge \neg d i 2]]$
$S_{x} \equiv *[[d i 1 \rightarrow x \uparrow ;[x] \| d i 2 \rightarrow x \downarrow ;[\neg x]]]$.
(Observe that $S_{X}$ is not changed.) An interleaving of $S_{C}$ and $S_{D}$ that implements $C^{\prime} \bullet D$ is the interleaving corresponding to two wires:
*[[co']; do $;[$ dil $\left.\left.\vee d i 2] ; c i^{\prime} \uparrow ;\left[ר c o^{\prime}\right] ; d o \downarrow ;[ \urcorner d i 1 \wedge \neg d i 2\right] ; c i^{\prime} \downarrow\right]$.
As to the implementation of the assignment to $x$, we now observe that, since $C$ and $D$ are active, there is no risk of the assignment to $x$ being started before
$C$ is. The interleaving obtained is

$$
\begin{align*}
& *\left[\left[c o^{\prime}\right] ; d o \dagger ; \quad[d i 1 \rightarrow x \uparrow \text { \| di2 } \rightarrow x \mid] ;\right. \tag{32}
\end{align*}
$$

which can be factored into the wire
$\left(c o^{\prime} \underline{\underline{w}} d o\right)=*\left[\left[c o^{\prime}\right] ; d o \dagger ;\left[r c o^{\prime}\right] ; d o l\right]$
and the register

$$
\begin{aligned}
(d i 1, d i 2) \underline{r e g}\left(x, c i^{\prime}\right) \equiv & *\left[[d i 1 \rightarrow x\rceil ;[x] ; c i^{\prime} \uparrow ;[ \urcorner d i 1\right] ; c i^{\prime} \downarrow \\
& \left.\square d i 2 \rightarrow x\rceil ;[\imath x] ; c i^{\prime}\right\rceil ;\lceil d i 2] ; c i^{\prime} \downarrow \\
& \mathrm{ll} .
\end{aligned}
$$

The implementation of the interface is shown in Figure 19.

### 20.3 Output Actions

In the case of an output, like out! $x$ or put! $x$, the implementation turns out to be the same for passive and active ports. Given the same nomenclature as in

Figure 18. Input actions on passive port.

the input case, port $D$ is now implemented with two output variables, dol and $d o 2$, and one input variable, di. Port $C^{\prime}$ is not changed. The rest of the derivation is straightforward and is left as an exercise for the reader. It leads to a wire and a read operator, which we have introduced in the implementation of the register:

$$
\begin{array}{cc}
d i \underline{w} c i n & *\left[[d i] ; c i^{\prime} \uparrow ;[\neg d i] ; c i^{\prime} \downarrow\right] \\
\left(c o^{\prime}, x\right) \mathrm{read}(d o 1, d o 2) \equiv & *\left[\left[x \wedge c o^{\prime} \rightarrow d o 1 \uparrow ;\left[\neg c o^{\prime}\right] ; d o 1 \downarrow\right.\right. \\
\left.\square \neg x \wedge c o^{\prime} \rightarrow d o 2 \uparrow ;[ \urcorner c o^{\prime}\right] ; d o 2 \downarrow \\
]] .
\end{array}
$$

The only difference between the active and the passive cases is that, in the active case, the read is activated first. In the passive case, the wire is activated first. The circuit is shown in Figure 20.

### 20.4 Active Input and Passive Output

A somewhat surprising result of this implementation of input and output commands is that, contrary to common belief, it is simpler to implement input commands with active ports than with passive ports. The gain is quite

Figure 19. Input actions on active port.

important: For $n$ bits of data, the active implementation saves $2 \times n$ asymmetric C-elements and $n$ or-gates. On the other hand, the implementation of output actions is the same for active and passive ports.

Therefore, we shall always implement input actions with active ports. When the input port is probed, like in in the stack example, we shall use a slightly more complicated implementation of the handshaking protocol that makes it possible to probe an active port.

### 20.5 Lazy-Active Protocol

Consider the active implementation of communication command $X$ :

$$
\text { xot; }[x i] ; x o t ;[\sim x i] .
$$

We introduce an alternative active protocol, called lazy-active:
[าxi];xo1; [xi];xol.
The lazy-active protocol is derived from the active one by postponing wait action [ $7 \times i$ ] until the beginning of the next communication on $X$, and by adding a vacuous wait action [ $\sim x i$ ] at the beginning of the first communication $X$. Hence, the lazy-active protocol is a correct implementation.

Consider sequence $X$; $S$, where $S$ is an arbitrary program part. With $X$ lazyactive, half of the communication delays overlap with the execution of $S$. The

Figure 20. Output-action interface.

gain is particularly important when data communication is involved, since half of the data-transmission delays and half of the "completion-tree" delays can overlap with the rest of the computation.

This important property of lazy-active protocols was discovered recently by Steve Burns. All input actions are now implemented as lazy-active. We have not done so in the stack, which is an older design.

## 21 The Complete Circuit for the Stack

The sharing of register $x$ by ports in and get has to be implemented either by a multiplexer or by a multiport flip-flop. Since only two ports share the register, we choose to use a dual-port flip-flop. The complete datapath is shown in Figure 21.

The complete circuit obtained by composing the different parts together is shown in Figure 22. An important optimization has been added to the

Figure 21. The complete datapath.


Figure 22. The complete circuit for a one-bit stack element.

design. It concerns the implementation of the second guard of $E$ :

$$
\overline{\text { out }} \rightarrow \text { get?x; out!x. }
$$

We observe that the value of $x$ involved in the second action (out! $x$ ) is the same as the value of $x$ involved in the first action (get?x). We can therefore encode the transmitted value in the handshaking expansion of the guarded command without having to use register $x$. We are tempted to make this optimization available to the programmer by allowing assignments to ports. We would then write

$$
\overline{o u t} \rightarrow \text { out!get. }
$$

The preceding modification leads to a significant simplification of the circuit since we can eliminate a D-element, and, for each bit of the data path, we can eliminate an IF-element and replace the multiport flip-flop with a simple flip-flop. The chip we have fabricated includes this modification, as well as the optimization that consists in making input port in active.

## 22 A Delay-Insensitive Fair Arbiter

This last example addresses the issues of arbitration between guards and unstable guards. We have already discussed the metastability property of arbiters. The realization of a delay-insensitive arbiter, however, raises another issue: fairness. An arbiter is strongly fair when a pending communication request is granted after a bounded number of other requests are granted. An arbiter is weakly fair when a request is granted after a finite but possibly unbounded number of other requests. Whether it is possible to construct a delay-insensitive fair arbiter has been, so far, an open question. It has been conjectured that delay-insensitive fair arbiters do not exist. In this example, we prove the existence of delay-insensitive fair arbiters by constructing one.

### 22.1 A Fair-Arbiter Program

The process fsel described in the first part defines a fair arbitration program between two unrelated inputs. We choose to implement the following simplified version of fsel:

$$
\begin{equation*}
*[[\bar{A} \rightarrow A \square \neg \bar{A} \rightarrow \text { skip }] ;[\bar{B} \rightarrow B \| \neg \bar{B} \rightarrow \text { skip }]] \tag{33}
\end{equation*}
$$

According to (33), when $\bar{A}$ holds, $A$ will be completed after at most one $B$ action, regardless of the current state of the computation. Hence, the arbiter is strongly fair towards requests $A$ and $B$. Assume that $A^{\prime}$ is pending at a certain
point of the computation. By definition of the probe, $\bar{A}$ is true eventually; i.e., a finite but unbounded number of $B$ actions can be completed between the moment $\mathbf{q} A^{\prime}$ holds and the moment $\bar{A}$ holds. Hence, the arbiter is only weakly fair towards requests $A^{\prime}$ and $B^{\prime}$.

Therefore, with this definition of suspension of an action, we can say that the arbiter is strongly fair towards requests that have reached the arbiter and weakly fair towards all requests. (We could redefine the suspension of a communication action $X$ such that $\mathbf{q} X$ holds only when the initiation of action $X$ can be observed by the other process. With this definition of suspension, we have $\mathbf{q} A^{\prime}=\bar{A}$. The arbiter is then strongly fair towards all requests.)

### 22.2 The Compilation

Applying the process decomposition rule, we decompose (33) into three processes ( $P 1\|P 2\| P 3$ ). Channels ( $C, D$ ) between $P 1$ and $P 2$, and ( $E, F$ ) between $P 1$ and $P 3$ are introduced:

$$
\begin{aligned}
P 1 \equiv & *[E ; C] \\
P 2 \equiv & *[[\bar{D} \wedge \bar{B} \rightarrow B ; D \\
& \square \bar{D} \wedge \neg \bar{B} \rightarrow D \\
& \| \\
P 3 \equiv & *[[\bar{F} \wedge \bar{A} \rightarrow A ; F \\
& \square \bar{F} \wedge \cdot \bar{A} \rightarrow F
\end{aligned}
$$

1].
Ports $D$ and $F$ are implemented as passive; ports $C$ and $E$ are implemented as active. Hence $P 1$ is the standard active-active buffer. The handshaking expansion of $P 2$ gives

```
\(P 2 \equiv *[[d i \wedge b i \rightarrow b o \uparrow ;[\neg b i] ; b o \downarrow ; d o \dagger ;\{\neg d i] ; d o \downarrow\)
```

[] di^ $\neg b i \rightarrow d o \dagger ;[\neg d i] ; d o \downarrow$
]].
Because bican change from false to true asynchronously, the second guard of $P 2$ is not stable; i.e., its value can change from true to false at any time.

In order to make both guards of P2 stable, we introduce the synchronizer

$$
\begin{gathered}
\text { sync }=*[[d i \wedge b i \rightarrow u \dagger ;[\neg d i] ; u\rfloor \\
0 d i \wedge \neg b i \rightarrow v \dagger_{i}[\neg d i] ; v \downarrow \\
]] .
\end{gathered}
$$

Sync is the standard operator we have described in Part I. We now have to find a process, $X$, such that $(X \| s y n c)=P 2$. Since sync is entirely defined, we would like to be able to perform the inverse operation of $\|$, or "process quotient", so as to compute $X$ as $X=(P 2 \div s y n c)$. A way to perform this quotient is to remove all actions of sync from $P 2$, and then to check whether the result fulfills $(X \| s y n c)=P 2$.

To perform the quotient as suggested, $P 2$ should be extended to contain all actions of sync, so that the orders of actions are compatible in sync and in the extended version of $P 2$. (This procedure is explained in [10].) The extension of $P 2$ gives


11 .

We obtain for $X$

[ $v \rightarrow d o \dagger ;[7 v] ;$ dol
]I.

The compilation of the first guarded command is facilitated if transition bol is postponed until after [ $\neg u$ ]. This transformation does not introduce deadlock since the completion of $D$ does not depend on the completion of B. After this transformation, the PR expansion gives

$$
\begin{array}{rlrl}
u & \mapsto b o \dagger & \neg u & \mapsto \\
\text { bol } \\
u \wedge \neg b i & \mapsto d o \dagger & v & \mapsto \\
\text { biv } 10 \uparrow \\
\text { biv } & \mapsto \text { dol } & \neg v & \mapsto d o \downarrow .
\end{array}
$$

The operator reduction, which includes the introduction of auxiliary variables
$d o^{\prime}$ and $d o^{\prime \prime}$, gives

$$
\begin{array}{rll}
u & \underline{w} & b o \\
(u, \neg b i) & \Delta & d o^{\prime} \\
v & \underline{w} & d o^{\prime \prime} \\
\left(d o^{\prime}, d o^{\prime \prime}\right) & \underline{v} & d o .
\end{array}
$$

The circuit is shown in Figure 23. The implementation of P3 is identical.

### 22.3 The Circuit

The final circuit, shown in Figure 24, is obtained by composing the two identical circuits implementing P2 and P3 with the circuit of P1. The reshuffled version of $P 1$, consisting of a wire and an inverter, can also be used if it can be proved that the reshuffling does not introduce deadlock. The circuit shown in Figure 24 includes a minor optimization that eliminates the negated inputs that are also the output of a fork.

## 23 Conclusion

We have described a method for implementing a concurrent program (a set of communicating processes) as a network of digital operators that can be directly mapped into a delay-insensitive VLSI circuit. The circuit is derived from the program by applying a series of systematic, semantics-preserving

Figure 23. Implementation of $P 2$.

transformations that we have compared to compiling. Hence, the circuits are correct by construction, and their logical correctness is independent of the delays in operators and wires, with the exception of isochronic forks.

The examples cover most of the constructs of the language but not all of them: We have not shown how to implement an arbitrary set of guards. Therefore, we have not quite shown that any program in the language can be compiled. Such a proof has been given in [1] and [2], where the compilation of each construct is described as part of the basic algorithm for an automatic compiler. It is shown that any program in a subset of the language can be implemented as a delay-insensitive circuit using only a small set of basic elements: the two-input C-element, the two-input or-gate or two-input andgate, the synchronizer, the inverter, and the isochronic fork.

There is no reason, however, for confining the designer to a minimal set of operators. On the contrary, since an advantage of VLSI is the possibility to create operators at no cost, introducing the special-purpose operator that exactly implements an arbitrary set of production rules often simplifies a circuit drastically.

In order to convince the VLSI community of the practicality of our method, it was essential to fabricate the circuits we had designed. Hence, all significant

Figure 24. Implementation of the fair arbiter.

examples that we have used in our research-distributed mutual exclusion, queues, stacks, routing automata for a communication network, the $3 X+1$ engine- have been fabricated in SCMOS using the MOSIS foundry service. They have all be found to be correct on "first silicon". They are also very robust and -given the low level of circuit optimization applied- surprisingly fast. The $3 x+1$ engine, constructed by Tony Lee, is a special-purpose processor consisting of a state-machine and an 80 -bit-wide datapath. It contains approximately 40,000 transistors and operates at over 8 MIPS (million instructions per second) in $2 \mu \mathrm{~m}$ MOSIS SCMOS technology.

At the moment of writing, we have just completed the design of the first asynchronous general-purpose microprocessor [12]. It is a 16 -bit RISC-like architecture with independent instruction and data memories. It has 16 registers, four buses, an ALU, and two adders. The size is about 20,000 transistors. Two versions have been fabricated: one in $2 \mu \mathrm{~m}$ MOSIS SCMOS, and one in $1.6 \mu \mathrm{~m}$ MOSIS SCMOS. (On the $2 \mu \mathrm{~m}$ version, only 12 registers were implemented in order to fit the chip on an 84 -pin $6600 \mu \mathrm{~m} \times 4600 \mu \mathrm{~m}$ package.)

The chips are entirely delay-insensitive, with the sole exception of the interface with the memories and, of course, the isochronic forks. In the absence of available memories with asynchronous interfaces, we have simulated the completion signal from the memories with an external -off-chip-delay. For testing purposes, the delay on the instruction memory interface is variable.

In spite of the presence of floating $n$-wells, the $2 \mu \mathrm{~m}$ version runs at 12 MIPS. The $1.6 \mu \mathrm{~m}$ version runs at 18 MIPS. (Those performance figures are based on measurements from sequences of ALU instructions without carry. They take no advantage of the overlap between ALU and memory instructions.) Those performances are quite encouraging given that the design is very conservative: no pass-transistors, static gates, dual-rail encoding of data, completion trees, etc.

Only 2 of the $122 \mu \mathrm{~m}$ chips passed all tests, but 34 of the $501.6 \mu \mathrm{~m}$ chips were found to be entirely functional.

We have tested the chips under a wide range of VDD voltage values. At room temperature, the $2 \mu \mathrm{~m}$ version is functional in a voltage range from 7 V down to 1 V ! It reaches 15 MIPS at 7 V . We have also tested the chips cooled in liquid nitrogen. The $2 \mu \mathrm{~m}$ version reaches 20 MIPS at 5 V and 30 MIPS at 12 V . The $1.6 \mu \mathrm{~m}$ version reaches 30 MIPS at 5 V . Of course, these measurements are made without adjusting any clocks (there are none), but simply by connecting the processor to a memory containing a test program and observing the rate of instruction execution. The power consumption is 145 mW at 5 V , and 6.7 mW at 2 V .

## 24 Acknowledgments

I am indebted to my students Steve Burns, Dražen Borković, Pieter Hazewindus, Tony Lee, Marcel van der Goot, José Tierno, and Kevin Van Horn for their contributions to the research and their comments on the manuscript. Acknowledgments are also due to Chuck Seitz, Jan van de Snepscheut, Martin Rem, and Huub Schols for numerous discussions on the topic.

## References

[1] Burns, S. M. "Automated compilation of concurrent programs into self-timed circuits". Technical Report CS-TR-88-2, M.S. Thesis, Computer Science Department, California Institute of Technology, 1988.
[2] Burns, S. M. and Martin, A. J. "Syntax-directed translation of concurrent programs into self-timed circuits". Proceedings of the Fifth MIT Conference on Advanced Research in VLSI, J. Allen and F. Leighton, eds., pp. 35-40. MIT Press, Cambridge, Mass., 1988.
[3] Dijkstra, Edsger W. A Discipline of Programming. Prentice-Hall, Englewood Cliffs, N.J., 1976.
[4] Hoare, C.A.R. "Communicating sequential processes". Communications of the $A C M$ 21, 8 (August 1978), pp. 666-677.
[5] Martin, A. J. "The probe: An addition to communication primitives". Information Processing Letters 20 (1985), pp. 125-130.
[6] Martin, A. J. "Compiling communicating process into delay-insensitive VLSI circuits". Distributed Computing 1, 4 (1986).
[7] Martin, A. J. "A delay-insensitive fair arbiter". Technical Report 5193:TR:85, Computer Science Department, California Institute of Technology, 1985.
[8] Martin, A. J. "FIFO: An exercise in compiling programs into circuits". In From HDL Description to Guaranteed Correct Circuit Design, D. Borrione, ed. North-Holland, Amsterdam, 1986.
[9] Martin A. J. "A synthesis method for self-timed VLSI circuits". ICCD 87: 1987 IEEE International Conference on Computer Design, pp. 224-229. IEEE Computer Society Press, Los Alamitos, Calif., 1987.
[10] Martin, A. J. "Formal program transformations for VLSI circuit synthesis". In Formal Development of Programs and Proofs, E. W. Dijkstra, ed. Addison-Wesley, Reading, Mass., 1989.
[11] Martin, A. J. "The limitations to delay-insensitivity in asynchronous circuits". Proceedings of the Sixth MIT Conference on Advanced Research in VLSI, W. J. Dally, ed. MIT Press, Cambridge, Mass., 1990.
[12] Martin, A. J., Burns, S. M., Lee, T.K., Borkovic, D., and Hazewindus, P. J. "The design
of an asynchronous microprocessor". Decennial Caltech Conference on VLSI, C. L. Seitz, ed., pp. 351-373. MIT Press, Cambridge, Mass., 1989.
[13] May, D. "Compiling occam into silicon". This volume (Chapter 3).
[14] Mead, C. and Conway, L. Introduction to VLSI Systems. Addison-Wesley, Reading, Mass., 1980.
[15] Miller, R. E. Switching Theory, Vol. 2. Wiley, New York, 1965.
[16] Seitz, C. L. "System timing." Introduction to VLSI systems. Chapter 7 of [14].
[17] Snepscheut, J. v. d. Trace Theory and VLSI Design. Lecture Notes in Computer Science, vol. 200. Springer-Verlag, Berlin, 1985.
[18] Weste, N. and Eshraghian, K. Principles of CMOS VLSI Design. Addison-Wesley, Reading, Mass., 1985.


[^0]:    1. We have made a restricted use of shared variables in the design of the microprocessor.
[^1]:    2. This notion of channel is unrelated to the one we introduced for communication among processes.
[^2]:    3. These transformations have not been applied to the circuits presented here as examples, but they are always applied before the circuits are actually implemented.
