# **High Rate Data Synchronization in GALS SoCs**

Rostislav (Reuven) Dobkin<sup>1</sup>, Ran Ginosar<sup>1</sup> and Christos P. Sotiriou<sup>2</sup>

<sup>1</sup>VLSI Systems Research Center, Technion—Israel Institute of Technology, Haifa 32000, Israel

<sup>2</sup>ICS-FORTH, Crete, Greece

rostikd@tx.technion.ac.il

#### **Abstract**

GALS SoCs may be prone to synchronization failures if the delay of their locally-generated clock tree is not considered. This paper presents an in-depth analysis of the problem and proposes a novel solution. The problem is analyzed considering the magnitude of clock tree delays, the cycle times of the GALS module and the complexity of the asynchronous interface controllers using a timed STG approach. In some cases, the problem can be solved by extracting all the delays and verifying whether the system is susceptible to metastability. In other cases, when high data bandwidth is not required, matched-delay asynchronous ports may be employed. A novel architecture for synchronizing inter-modular communications in GALS, based on *locally delayed latching* (LDL), is described. LDL synchronization does not require pausable clocking, is insensitive to clock tree delays, and supports high data rates. It replaces complex global timing constraints with simpler localized ones. Three different LDL ports are presented. The risk of metastability in the synchronizer is analyzed in a technology-independent manner.

#### 1. Introduction

As silicon technology continues to make rapid progress, Systems on Chip (SoCs) incorporate an increasing number of modules of growing sizes, operating at faster clock frequencies. These developments make it ever more difficult to distribute a single synchronous clock to the entire chip [1]. As an alternative, different methods for providing each module with its own clock are being developed. Another motivation for independently clocking different modules is to reduce power consumption by means of dynamic voltage and frequency scaling (DVFS) [2]—[4]. When the clock frequencies of the various modules are uncorrelated with each others, and when they can change over time independent of the clocks of other modules, the resulting SoC is termed a Globally Asynchronous, Locally Synchronous (GALS) system [5][6]. Each GALS module (a "Locally Synchronous Island") can be enclosed in an asynchronous wrapper (Figure 1), which facilitates inter-modular communications and generates the clock for the module [7]-[14].



Figure 1. GALS System [9]

Data synchronization and communication across clock domains in GALS architectures constitute a major challenge. The simple "two-flop" synchronizer typically incurs significant multi-cycle latency and limits the throughput. An alternative solution is provided by elastic FIFO buffers. The most promising approach employs stoppable or stretchable clocks for the GALS modules: Port controllers (Figure 1) can pause the local clock when sampling asynchronous input. By stopping the local GALS clock during data transfers across clock domains, the possibility of metastability is eliminated [7]-[14].

This paper offers two main contributions. First, we demonstrate that GALS systems employing pausable clocks are subject to failures, resulting from delays in their clock distribution networks. We propose modifications to the existing stoppable clock method, which mitigate these problems and yield robust GALS circuits. The modifications require post-layout verification of certain constraints on the clock network delays, to assure safe GALS clocking.

Second, we describe a novel synchronization technique for GALS SoC, *Locally Delayed Latching* (LDL). It does not require pausable clocking, it is insensitive to clock tree delays, and it provides high data rate synchronization. We also propose a performance enhancement of LDL over its previous version [15]. Detailed reliability analysis for LDL is presented. In addition, we present a number of possible GALS wrappers based on LDL.

The paper begins with a survey of related research, in Sect. 2. Standard GALS clocking and synchronization are reviewed and analyzed in Sect. 3. Their potential for failure and exact failure conditions of the GALS approach are demonstrated in Sect. 4. In Section 5, modifications to the conventional GALS approach are presented that mitigate failures. In Sect. 6 we present LDL synchronization and analyze its failure probability in a technology independent manner. LDL simulation results are discussed in Sect. 7.

### 2. Related Work

Two principal clocking and synchronization methods have been proposed for solution of the data synchronization problem. Clock synchronization employs handshake clocks that are stopped based on inputs from other domains [7]. Stoppable local clocks have been proposed in [8]-[13]. According to that methodology, a local ring-oscillator clock generator in each synchronous "island" incorporates a set of MUTEXes [16] that stop the clock temporarily when new input data arrive, so as to avoid the risk of metastability. We analyze this methodology in detail in Sect. 3.

Stoppable clocks have been introduced for GALS system in [12][13]. Asynchronous inter-modular communication is decoupled from the stoppable clock interface of the synchronous modules. A modified two-phase asynchronous interface wrapper for communication between two locally synchronous modules is presented in [8]. The authors also propose FIFO buffering for performance enhancement. A four-phase version of the asynchronous GALS wrapper, which handles multiple ports and also facilitates testing, is presented in [9]. A number of GALS interconnect structures and modified wrappers are analyzed in [10], focusing on ring topology and packet based communications. An architecture for combining synchronous and asynchronous modules in a GALS system is presented in [11], employing handshake based on matched delays. FSM-based demand and poll port controllers are also presented.

A stoppable clock technique for GALS pipelines [14], which does not employ MUTEX arbiters, accounts for clock tree delays by means of delay matching and relies on accurate timing analysis of the clock tree. This solution is only suitable for linear pipelines and does not generalize to arbitrary GALS SoC communications.

In [17], a mixed-timing FIFO was proposed for communication between arbitrary combinations of synchronous and asynchronous domains. Mixed timing relay stations were also introduced for more efficient treatment of long interconnects. Source-synchronous communication, based on a self-timed single-stage FIFO with a single stage for mesochronous clock domains was presented in [18] and expanded to multi-synchronous, plesiochronous and asynchronous cases in [19]. The extensions are more complex relative to the mesochronous case, requiring additional special treatment at the transmitter and receiver sides.

In [20] abstract timing diagrams are used for analyzing synchronization interfaces, instead of the Signal Transition Graph (STG) analysis [21] employed in [15]. In addition, [20][22] propose a new interfacing scheme, with a pausable clock generator at the transmitter side and free-running clock and partial handshake on the receiver side. Unfortunately, the authors report in [20] that without a biased mutex (a mutex that prioritizes one of the inputs, which is in practice physically unrealizable) the scheme may lead to "erroneous communication and loss of messages" when the communicating modules are not perfectly synchronized (in terms of data throughput). [20] proposes a FIFO to partially solve that problem, whereby the clock is paused on the transmitter side when the FIFO is full. A detailed circuit at the transistor-level is presented in [22]. The circuit is reported to have an unknown non-zero failure probability.

### 3. Synchronization in Locally-Clocked GALS SoC

Clocking and input synchronization circuits for locally-clocked SoC proposed in the literature [8]-[13] are mostly variations of the circuit in Figure 2. A locally generated pausable clock is employed in each Locally Synchronous Island. Input and output to other islands are controlled through asynchronous handshake on special ports. In this section we analyze the operation of the circuit in Figure 2.

The wrapper circuit operates as follows. Data arrival is indicated by REQ signal. In response, the port produces signal R, asking the local clock generator for a clock pause. Once the clock is paused, AK is asserted, enabling data

latching by the port latch. After the data is latched, the port de-asserts R, enabling the clock to the Locally Synchronous Island in general and to REG in particular. Since the data (D) is stable by that time, it is assured that REG samples D correctly. At this point, ACK can be asserted, releasing the input handshake (REQ, ACK). Alternatively, ACK can be asserted right after data latching.



Figure 2. Stoppable clock generation [8]

The clock generator [23] comprises a ring oscillator (consisting of the adjustable delay line, the NOR gate and the C-element) and an arbitration circuit (Figure 2). The arbitration circuit employs mutual exclusion elements (MUTEX, Figure 3) [16]. The MUTEX grants  $G_i$  in response to request  $R_i$ , and guarantees that even if both requests arrive simultaneously, only one of them is granted. This is achieved by means of the metastability filter appended to the latch in the MUTEX. Five different signals related to the clock are shown in Figure 2: A, B, X, Y and Z. Each incoming request signal (REQ) is presented to the MUTEX (R), asking for a clock pause. The MUTEX decides whether to grant the request (AK) or to allow the next clock pulse. The next clock pulse will take place only if all MUTEXes allow it, *i.e.* node B goes high. Let's denote the rising and falling transitions of a node n by n+ and n-, respectively. The C-element allows for the local clock signal X to be stretched (X+ transition is blocked) whenever any of the incoming R requests is granted (B is low) when A is rising. X+ will remain blocked until all granted R requests are released (and B eventually goes high).



Figure 3: A mutual exclusion element (MUTEX)

The timing diagram in Figure 4 illustrates the stoppable clock generation process. R+ is enabled in the MUTEX only when signal Z is low. The clock cycle is stretched when R+ arrives during a stretch window,  $\alpha$ , towards the end of the low phase of signal Z. If R+ arrives outside the stretch window, port handshake may complete on time (B+ precedes A+), causing no stretch.



Figure 4. Waveforms of stoppable clock generation

The stretching process can also be described formally with a timed STG (Figure 5), where the arcs are numbered for identification, and the labels on the arcs indicate symbolic transition delays. The STG is a special type of a Petri Net [21]. Tokens are marked by solid circles and their position (marking) determine the circuit state; the token marking in Figure 5, denotes the initial state. Change of state is denoted by moving tokens along directed edges. A transition of node n is enabled when every incoming arc holds a token. When the transition takes place (node n "fires"), all incoming tokens are consumed and new tokens are produced on each outgoing arc. Places (marked by open circles) hold tokens in transit. It is assumed that every arc has a place for holding a passing token, but places are eliminated from the figure when there is no ambiguity. Place p1 is a choice place: The token can exit on either

arc (13 or 14) but not on both, representing the free (random) choice made by the MUTEX in case of contention between Z+ and R+. Place p2 is a merge: It merges tokens arriving on either arc (19 or 20, depending on the previous choice) into arc (21).

The symbolic transition delays ( $D_{MA}$ ,  $D_{CE}$ ,  $D_L$ ,  $D_{NOR}$ ,  $D_{AC}$ ) are defined in Figure 4. The dashed arc labeled  $\delta$  designates the delay from Z- to R+. The timing of R+ may make the  $\delta$  arc part of the critical path in the circuit, stretching the clock.  $\Delta_{CLK}$  denotes clock tree delay from X+ to Y+, and  $\xi$  denotes the delay from incoming R+ until the data latching event D.



Figure 5. Timed STG of the local stoppable clock of Figure 2

The stretch-length (STRETCH in Figure 4) can assume either a deterministic or non-deterministic value. STRETCH is deterministic when there is no contention between R and Z signals at any MUTEX input. In case of contention, the MUTEX incurs an additional non-deterministic delay,  $T_{m/s}^{MUTEX}$ , causing the STRETCH to become non-deterministic too. The contention happens when R and Z both rise within a danger window, W, typically three to four gate-delays long.

Let  $\delta' \in [0, T)$  be the time from Z+ to R+. Since R+ is ignored when Z=1, we define  $\delta$  as the effective time from Z- to a port request, as follows:

$$\delta = \begin{cases} \delta' + \frac{T}{2}, & 0 \le \delta' \le W \\ 0, & W < \delta' \le \frac{T}{2} \\ \delta' - \frac{T}{2}, & \frac{T}{2} < \delta' < T \end{cases}$$
 (1)

Note that  $0 \le \delta < T/2+W$ . From Figure 5 it can be observed that a stretch occurs if the path  $6 \to 23 \to 15 \to 16$  $\to 17 \to 19 \to 21 \to 13 \to 4$  takes longer than a clock cycle:

$$\delta + D_{AC} + D_{MA} + D_{CE} + D_{NOR} > T \tag{2}$$

Note that in case of contention at a MUTEX input,  $D_{AC}$  takes  $T_{m/s}^{MUTEX}$  longer than in a non-contending case (arc 15 timing is extended by additional  $T_{m/s}^{MUTEX}$  delay, see Figure 5).

Stated otherwise, we have two sets of stretch conditions:

$$T - (D_{AC} + D_{MA} + D_{CE} + D_{NOR}) < \delta < \frac{T}{2} - W$$
 (3)

$$\frac{T}{2} - W \le \delta < \frac{T}{2} + W \tag{4}$$

When inequality (3) holds, the stretch is deterministic. In the other case (inequality (4)) it is non-deterministic. Subtracting the lower bound of (3) from the upper bound of (4), we obtain  $\alpha$ , the size of the stretch window:

$$\alpha = (D_{AC} + D_{MA} + D_{CE} + D_{NOR}) + W - \frac{T}{2}$$
(5)

The stretch length, STRETCH, is the difference between the two sides of inequality (2):

$$STRETCH = (\delta + D_{AC} + D_{MA} + D_{CE} + D_{NOR}) - T$$

$$(6)$$

Note that in case of contention at a MUTEX input the stretch is extended by the delay of metastability resolution  $T_{m/s}^{MUTEX}$  (and the delay  $D_{AC}$  becomes non-deterministic). Combining equations (5) and (6), and separating the contention case from the non-contending case we get the following expression for the stretch:

$$STRETCH = \begin{cases} \delta + \alpha - W - T/2, & T - (D_{AC} + D_{MA} + D_{CE} + D_{NOR}) \le \delta < T/2 - W \\ \delta + \alpha - W - T/2 + T_{m/s}^{MUTEX}, & T/2 - W \le \delta \le T/2 + W \end{cases}$$
(7)

If the clock cycle is relatively long, inequality (3) becomes infeasible. However, contention is still possible if R+ happens within  $\pm W$  of Z+.

Fortunately,  $T_{m/s}^{MUTEX}$  can be bounded for any practical application as explained in Sect. 6.2 below. Therefore, when a relatively long clock cycle is employed, the stretched probability becomes insignificant.

Arc 23 represents only the unknown delay  $\delta$  between the clock and the asynchronous request. It does not represent any causal relationship between them, and it is not an essential arc in the STG. Indeed, infinitely many tokens may accumulate on arc 23, as a result of a free running clock when no REQ arrives. All other arcs can accommodate at most one token at a time.

### 4. Synchronization Failures in a Locally-Clocked GALS SoC

The approach described in Sect. 3 disregards the delay  $\Delta_{CLK}$  along the clock tree (from node X to Y), thus potentially causing metastability events in the sampling register, REG, of the Locally Synchronous Island (Figure 2). A failure scenario is detailed in Sect. 4.1 and analyzed in Section 4.2.

#### 4.1. A Clock Delay Failure

A failure caused by the clock tree delay is depicted in Figure 6. Let's assume that a request comes with delay  $\delta$  after Z– and is granted by the MUTEX. Being uncorrelated with the input handshake, the delayed Clock Y may rise simultaneously with the asynchronous data latching in the Port. This conflict can cause metastability in the input REG of the Synchronous Island.  $\xi$  in Figure 6 denotes the circuit delay from R+ to latching data in the port (shown also in STG of Figure 5). In addition to the internal asynchronous port delay,  $\xi$  comprises the MUTEX delay, which may become larger in case of concurrent R+ and Z+ as explained in the previous section. Note that, even though Figure 6 presents the conflict during a stretched cycle, the conflict may happen also when no stretch of the clock occurs, since the events AK+ and Y+ are uncorrelated.

In addition to the metastability problem, this approach suffers from two other drawbacks, *i.e.* pausing the local clock slows down the entire Synchronous Island, and the slowdown may be exacerbated with multi-port GALS modules, where the probability of pausing the clock is higher. Slowing down those synchronous islands which are critical to system performance may slow down the entire system.



Figure 6. Conflict example

#### 4.2. Conflict Analysis

In this section we analyze the conditions required for a conflict event. Starting from X+, the conflict occurs when:

$$\Delta_{CLK} \approx \delta + \xi + D_{NOR} \tag{8}$$

Namely, when the delay along arcs  $6 \rightarrow 23 \rightarrow 15 \rightarrow 16$  on the STG matches the delay along arc 5 in Figure 5. More precisely, the conflict occurs when Y+ happens inside a "danger window" 2W (setup+hold time) around  $\delta + \xi + D_{NOR} + k \cdot T$ , where k is an integer (k>0 accounts for clock delays longer than T). The  $\delta$  value is unknown, but the probability of conflict grows with the number of GALS module ports. In addition, the non-deterministic delay  $\xi$  can be bounded for any practical implementation as explained in Sect. 6.2. Figure 7 emphasizes graphically the combinations of  $\Delta_{CLK}$  and  $\delta$  that lead to conflicts. The graph represents the timing relationships of equation (8). Note that for some values of  $\Delta_{CLK}$ , independent of  $\delta$ , the probability of conflict is negligible ("safe" regions S in Figure 7). Alternative solutions that avoid such conflicts are described in Sect. 5.



Figure 7. Conflict zones

Clock tree delay is a function of the clock tree depth and depends on both technology and architecture. In traditional synchronous circuit design the delay of the clock tree is immaterial, as the clock is constantly running, and only the skew is important. Clock tree delay is proportional to the number of sequential elements driven by the clock and to the tolerable skew. Clock skew balancing becomes increasingly difficult for high-performance large SoC designs, incurring higher clock tree delays. For instance, in a 0.18 $\mu$ m technology, a typical clock frequency achievable with standard EDA tools and standard libraries is 100–500MHz (T=2–10ns), while typical clock delays are 1–2ns, depending on module size (some examples are presented in Table 1). Large SoCs, with tens of modules, may require much longer clock delays, approaching T. With faster technologies and larger chips,  $\Delta_{CLK} > T$  may become common if a single global synchronous clock is attempted for the entire SoC. Thus, while  $\delta \in [0, T/2+W)$ , the range of the clock tree delay is not limited by T.

Table 1. Clock tree delays – implementation examples, 0.18µm technology

| Design                             | Clock Frequency | Clock Cycle, T | Clock Skew | Clock Tree Delay    |
|------------------------------------|-----------------|----------------|------------|---------------------|
| DLX-SYNC (FF)                      | 278 MHz         | 3.60 ns        | 60 ps      | 0.530 ns (15% of T) |
| DES (FF)                           | 540 MHz         | 1.85 ns        | 180 ps     | 1.168 ns (64% of T) |
| AES (OpenCores)                    | 350 MHz         | 2.85 ns        | 165 ps     | 1.111 ns (39% of T) |
| MEM Control (OpenCores)            | 200 MHz         | 5.00 ns        | 126 ps     | 1.014 ns (20% of T) |
| Dual Clock MEM Control (OpenCores) | 200/100 MHz     | 5.00 ns        | 137 ps     | 1.016 ns (20% of T) |

# 5. Metastability-Free GALS Clocking

In this section we present metastability-free circuits for robust data synchronization of GALS modules with stoppable clocking. In Section 5.1 we propose an approach for verifying the correctness of the original circuit of Figure 2. The approach is based on delays extracted from the layout of the original circuit. Section 5.2 proposes a modification of Figure 2 that avoids metastability at the expense of performance. In Section 6 we introduce novel LDL approach for data synchronization.

#### 5.1. Timed Clock Trees

In this section we show how to verify whether the circuit of Figure 2 performs correctly in a given chip. As we show in Figure 7, negligible failure probability is expected for the values of  $\Delta_{CLK}$  that fall inside safe regions S. The size and position of the safe regions depend on a variety of parameters, such as the clock cycle length, design library cell delays (e.g. NOR-gate delay) and asynchronous port delay. It may be possible to verify that a conflict probability is negligible by performing timing analysis of the physical design and verifying that  $\Delta_{CLK}$  falls only inside the safe regions.

For example, let's consider the case of a limited delay clock tree, when  $\Delta_{CLK} \leq T$ . We verify that either:

$$\Delta_{CLK} < D_{NOR} + \xi - T_H \tag{9}$$

Or

$$\Delta_{CLK} > D_{NOR} + \frac{T}{2} + \xi + T_{SU} \tag{10}$$

 $T_{SU}$  and  $T_H$  are the set-up and hold times of the DFF, respectively. When either rule holds, Y+ will occur only inside safe region S (Figure 7 and Figure 8). The first bound results from the  $\delta$ =0 case and the second bound from the  $\delta$ =T/2+W case. Both cases relate to the 6  $\rightarrow$  23  $\rightarrow$  15  $\rightarrow$  16 path in Figure 5.



Figure 8. Conflict and safe zones (example)

Similarly, when  $\Delta_{CLK}$ >T, the port access is allowed only during the S intervals (Figure 7 and Figure 8). In this case either

$$\Delta_{CLK} < D_{NOR} + \xi - T_H \tag{11}$$

Or the following two equalities must hold:

$$D_{NOR} + \frac{T}{2} + \xi + T_{SU} + k \cdot T < \Delta_{CLK}$$

$$D_{NOR} + \xi - T_H + (k+1) \cdot T > \Delta_{CLK}$$
(12)

Where k=0,1,2,...

This solution has several drawbacks. The correctness constraints must be verified after each layout iteration of the circuit. The solution is not scalable and it may be sensitive to thermal and power supply voltage variations (different changes in  $\xi$ ,  $T_{SU}$ ,  $T_H$  and  $D_{NOR}$ ).

## 5.2. Matched Delay Port Control

An alternative solution to fulfilling the clock tree depth constraints presented above is to insert a delay line into the circuit of Figure 2, thus matching the clock-tree delay  $\Delta_{CLK}$ , as shown in Figure 9. By delaying the handshake, it can be guaranteed that it will always happen after the clock has been stopped.



Figure 9. Stoppable clock generation with matched clock-tree delays



Figure~10.~Matched~delay~port~control-wave~diagram

However, the use of this matched delay may cause longer clock stretching, as demonstrated in Figure 10, where in the worst case the stretch is additionally expanded by  $\Delta_{CLK}$ . Note that the stretch window  $\alpha$  is also expanded to  $\alpha$ ' (up to T/2+W):

$$\alpha' = \begin{cases} \alpha + \Delta_{CLK}, & \Delta_{CLK} \le \frac{T}{2} + W - \alpha \\ \frac{T}{2} + W, & \Delta_{CLK} > \frac{T}{2} + W - \alpha \end{cases}$$
(13)

During the design process, it must be verified that the matched delay always exceeds  $\Delta_{CLK}$ , over all possible PVT variations (namely, all corners and all in-die process variations). In addition, this type of solution is not viable for designs with high clock rates, which often imply long clock tree delays. In such designs, clock stretch may happen on each handshake, since in this case it is very likely that  $\alpha \rightarrow T/2$ . In addition, this approach presents similar drawbacks as the "constrained delay clock tree" above.

# 6. Locally Delayed Latching (LDL) Synchronization

In this section we introduce the Locally-Delayed Latching (LDL) approach, which allows for GALS intermodular communication and synchronization without the need for an arbitrated clock. It is shown that the LDL approach replaces the constraints on the clock delay (which were discussed in Section 5) by simpler and more localized timing constraints, which are easier to achieve and verify. The LDL concept is described in Sect. 6.1, and the LDL trade-off between reliability and data rate is analyzed in Sect. 6.2. LDL constraints are detailed in Sect. 6.3. A further enhancement of LDL is described in Sect. 6.4. and expected performance is discussed in Sect. 6.5. Finally, in sections 6.6-6.8 we provide three implementation examples of input and output ports. Sect. 7 presents simulations of these implementations.

#### 6.1. LDL Principles

In LDL input port synchronizer, the asynchronous controller (Figure 11) controls both the input latch and Y1, the clock input to the first sampling register. Signal Y, the local clock of the module, is uninterrupted. In addition, the port issues a valid indication for each newly received data word and prevents write after read (WAR) hazards. Various modes of LDL operation are demonstrated in Figure 12.

In LDL, the clock of the locally synchronous island is never stopped. The only measure available is to delay Y1+ when a conflict is imminent. Y1- is unaffected, and only the high-phase is shortened. A port request is accepted only during the low-phase of Y, latching the incoming data (L+) and delaying Y1+ when needed. The conflicts between Y+ and REQ+ are resolved by a MUTEX inside the control. A number of such asynchronous controllers for generating L and Y1 are presented in sections 6.6—6.8.

LDL is unaffected by clock cycle changes that can be caused for instance due to dynamic frequency or voltage scaling [2]-[4]. There is also no restriction on stopping the clock during periods of inactivity.



Figure 11. Locally delayed latching (LDL) circuit



Figure 12. Locally delayed latching operating modes

# 6.2. LDL Synchronization Reliability and Performance

The worst case operation occurs when at the conflict between REQ+ and Y+, REQ+ wins. In this case the highphase of Y1 is maximally shortened, shown in Figure 12, in the "Port Wins" case. The shorter cycle leaves less time for computing in the combinational logic immediately following the first register. The implementation must assure that the remaining high phase is long enough, according to the restrictions on the minimal high phase width for FFs or registers of the target library. The high phase of the clock is shortened by an amount equal to the latency of the asynchronous control ( $D_{CTRL}$ ) and the MUTEX resolution latency. To analyze this further, we define the following:

*Resolution*: A metastable MUTEX *resolves* when the value stored in its internal latch is set non-deterministically to either 1 or 0, and all combinational functions of that value (MUTEX outputs) have been evaluated. The resolution latency is indeterminate and unbounded.

Failure: A circuit is said to fail if a combinational function of the output of a metastable MUTEX of that circuit does not resolve within a pre-defined maximal time,  $T_{m/s}^{MUTEX}$ .

Safety: A circuit is M-safe if the expected time between two successive failures exceeds M (M is also known as mean time between failures, MTBF) [24].

Min High Clock Phase:  $T_{HP}^{Min}$  is a minimally allowed clock high-phase time for a FF (typically about three FO4 inverter gate delays).

We require that the SoC be at least M-safe, where a selected value for M could be 100 years (other values may also be used). To achieve that, the safety of each synchronizer in a SoC with about K=100 synchronizers must be at least K times larger, namely M=10,000 years [25]. We note that, in a standard SoC (a digital IC based on standard cells and designed using standard EDA tools) the shortest clock cycle is typically about 100—160 FO4 inverter delays [26]. The nominal FO4 inverter delay depends mostly on the process technology (as detailed in Sect. 6.5). Thus, the fastest high phase (50% of the clock cycle) is about 50 inverter delays long. In order to assess the worst case MTBF in the following equation, we assume that  $\tau$  and W are one and two FO4 inverter delays, respectively [24],  $F_D = F_C$  (worst case analysis), the clock cycle  $T = 100\tau$ , and thus  $F_C = F_D = 1/100\tau$ . We can determine the required metastability resolution time  $T_{m/s}^{MUTEX}$  in terms of a number of gate delays N, by solving:

$$MTBF = \frac{e^{\frac{T_{MUSEX}^{MUTEX}}{\tau}}}{W \cdot F_C \cdot F_D} = \frac{e^N}{2 \times \frac{1}{100} \times \frac{1}{100}} \times \tau \ge 10^4 \text{ years}$$
(14)

For  $10^{-11} < \tau < 10^{-10}$  sec (the range of FO4 gate delays in present and foreseeable technologies, cf. Sect. 6.5):

$$41 < N \approx \ln(\frac{10^7}{\tau}) < 43 \tag{15}$$

For T=100 $\tau$ , this implies that at least one half of a symmetric clock cycle should be allowed for resolution. For slower SoCs, e.g. where the fastest clock cycle is 160 $\tau$  [26], a quarter clock cycle suffices to achieve this MTBF. For most aggressive designs (such as high-speed processors or high speed ASIC modules) where  $10\tau < T < 50\tau$ , a different approach based on multi-cycle resolution time or on multi-synchronous clocking [27] is required.

#### 6.3. LDL Constraints

As explained in Sect. 6.2, to guarantee minimal high clock phase  $T_{H\!P}^{\mathit{Min}}$  , we require that

$$\frac{T}{2} - D_{CTRL} - T_{m/s}^{MUTEX} > T_{HP}^{Min} \tag{16}$$

Leading to a constraint on the asynchronous control delay:

$$D_{CTRL} < \frac{T}{2} - T_{HP}^{Min} - T_{m/s}^{MUTEX} \tag{17}$$

Another constraint applies to  $D_L$ , the delay of the combinational logic that follows REG1. When the rising edge of Y1 is delayed (by up to  $D_{CTRL} + T_{m/s}^{MUTEX}$ ), the effective computation time in that logic stage becomes shorter than the clock cycle. Therefore, the following should be satisfied:

$$D_L < T - D_{CTRL} - T_{m/s}^{MUTEX} \tag{18}$$

 $D_{CTRL}$  contains additional buffering delays when wide data path is required. These constraints are verified in Sect. 7 for the implementations in Sections 6.6-6.8.

#### 6.4. LDL Performance Enhancement

When a clock faster than T=160 $\tau$  is employed, MTBF requirements prohibit shortening the high phase of Y1. To circumvent that obstacle, a Minimum Phase Generator (MPG) is employed as in Figure 13 to guarantee that the high Y' clock phase is no shorter than  $T_{m/s}^{MUTEX} + T_{HP}^{Min} + D_{CTRL}$  (see Eq. (16)). For fast clocks, this minimum is longer than half a clock cycle. An example of a 75% duty cycle Y' clock is shown in Figure 14. In this case, Y1 is also 75% duty cycle when there is no conflict. In time of conflict, there is sufficient margin in the high phase of Y' (at least 43 FO4 gate delays as in Eq. (15)), and Y1 is guaranteed to be no shorter than  $T_{HP}^{Min}$ . The request is treated during the low-phase period of Y'.



Figure 13: Locally delayed latching circuit with MPG for an extended high phase



Figure 14: Extended high phase for high frequency operation. After REQ—Y' conflict, REQ wins.

# 6.5. LDL Performance vs. Technology

Considering minimal low-phase and high-phase width of about three gate-delays each, asynchronous controller latency of about 20 gate-delays, and preserving 43 gate delays for metastability resolution (Eq. (15)), the minimal clock cycle is 69 gate-delays. Without the MPG enhancement of Sect. 6.4, when 50% duty cycle clock is employed, the minimal clock half phase requires 3+20+43=66 FO4 gate delays, namely the minimal clock cycle is 132 gate delays. The two alternative circuits enable clock frequencies up to the bounds shown in Table 2. The table is based on scaled FO4 inverter delays, derived from the FO3 NAND delay model of the ITRS [26]. Similar results were reported also in [28][29]. This table applies to ASIC, and is inapplicable to high speed microprocessors, where the cycle time is typically around 20 gate delays.

Table 2. Maximal clock frequencies achievable with locally delayed latching

| Technology | FO4 Gate Delay* | Maximal Rate (No MPG, Figure 11) | Maximal Rate (With MPG, Figure 13) |
|------------|-----------------|----------------------------------|------------------------------------|
| 350 nm     | 95 ps           | 79 MHz                           | 153 MHz                            |
| 250 nm     | 67 ps           | 112 MHz                          | 216 MHz                            |
| 180 nm     | 47 ps           | 160 MHz                          | 308 MHz                            |
| 130 nm     | 33 ps           | 228 MHz                          | 439 MHz                            |
| 90 nm      | 22 ps           | 342 MHz                          | 658 MHz                            |
| 65 nm      | 15 ps           | 505 MHz                          | 966 MHz                            |

<sup>\*</sup>Delays are based on High Performance processes [26]

## 6.6. Decoupled Input Port

Figure 15 shows a possible implementation of Figure 11. Without a conflict, Y1+ is either not delayed or delayed by less than  $D_{CTRL}$ . R2'+ is granted only during the low-phase of Y. The MUTEX arbitrates any conflict between R2'+ and Y+. When R2'+ wins over Y+, the asynchronous controller is granted (R3+). The controller employs an asymmetric matched delay  $Do \rightarrow Di$  to open the latch and then close it again  $(L+\rightarrow L-)$ . The asymmetric delay applies a longer delay to one of the two edges. In our case, the delay is minimal for the falling edge and is longer for the rising one. The long delay matches the latch propagation delay for all corners and delay variations. After R2'-, Y1+ triggers REG1, leading to a shortened cycle in the combinational logic following REG1 (the cycle is shortened by  $D_{CTRL}$ ). If the clock wins over R2'+, R3+ happens only half a cycle later, after Y-.



Figure 15. GALS module decoupled input port

The STG of the Decoupled Input Port control is shown in Figure 16. The controller delay is measured along the dashed path. The path is contained entirely inside the input port, ensuring that any reduction of the clock cycle depends solely on the input port control logic (and not on the logic and clock of the transmitter module). The MUTEX output Y1 should be buffered with a low-skew net when wide data path is required. In this case, the additional latency must be taken into account. The latency of the controller is verified in Sect. 7.

Generation of the Valid signal is performed as follows. For each granted data transfer the control issues signal L that latches the data and concurrently sets Valid to logic high, indicating a new ready data word inside the data latch. Once latching is accomplished, the controller de-asserts R2 (R2-) and along with Valid+ releases the MUTEX. The next incoming data transfer request (R2+) is blocked (by means of the c-element) until the data is sampled by REG1. Valid is released with the next rising edge of Y1, enabling also arbitration of the next incoming data transfer request. The SR latch of the Valid signal is free of conflicts: L and Y1 are guaranteed to be mutually exclusive thanks to the MUTEX.



Figure 16. GALS decoupled input port asynchronous control STG

# 6.7. Decoupled Output Port

Output ports of GALS modules are subject to the same issues as the input ports which were discussed in the previous sections. The difference is that with an output port it is the incoming ACK signal which must be synchronized. An output port circuit is shown in Figure 17 and the STG of its control is shown in Figure 18. The internal acknowledge (A1) is decoupled from the external asynchronous handshake.

The controller latency of the Decoupled Output Port control,  $D_{CTRL}^{OUT}$ , should be verified according to Eqs. (17) and (18). In this case:

$$D_{CTRL}^{OUT} = D[A4+ \rightarrow A1+ \rightarrow A3-] \tag{19}$$

The latency of the controller is verified in Sect. 7.



Figure 17. GALS module decoupled output port



Figure 18. GALS module decoupled output port asynchronous control STG

## 6.8. A Simpler Input Port

The architecture in Figure 19 simplifies the input port of Sect. 6.6, eliminating the asynchronous controller from the Input Port.

The input port delay  $D_{CTRL}$  now depends on the external delays of the output port:

$$D_{CTRL} = D_{LATCH} + D_{OutputPort}(ACK+ \rightarrow REQ-)$$
(20)

The matched delay in Figure 19 could have been reduced from  $D_{LATCH}$  to  $D_{LATCH}$ — $D_{OutputPort}$ , but when  $D_{OutputPort}$  is unknown a-priori, it is better to leave the matched delay at  $D_{LATCH}$ . This simple input port is compatible with the output port of Sect. 6.7. The latency of this constellation (Simple Input Port with the Decoupled Output Port) is also verified in Sect. 7.



Figure 19. Simple input port

### 7. Simulations

The circuits of sections 6.6-6.8 were synthesized using Petrify [30], converted to VHDL, synthesized by the Synopsis Design Compiler using 0.35µm and 0.25µm CMOS libraries [31][32], and verified by gate level simulations with wire-load model delays (SDF). Table 3 lists the results for the three controllers. We employed 16 bit wide data buses and clock cycles of 160 FO4 gate delays in a standard ASIC [26].

According to Eq. (17), while one-quarter cycle is preserved for metastability resolution, we are left with another quarter cycle (40 inverter delays) for the delay of the asynchronous controller and for the clock high-phase. Our  $0.35\mu m$  library specifies  $T_{HP}^{Min}=0.361\,ns$ , namely about 3 inverter delays, leaving 37 gate delays for the asynchronous controller delay,  $D_{CTRL}$ . Thus, we should verify that the delay of the circuits described in sections 6.6-6.8 is less than 37 gate delays. The circuit delays are listed in Table 3. All of them fulfill the delay requirement.

Table 3. Controller delays

| Circuit                                      | Critical Path                                                    | Latency (0.35μm) | Latency (0.25μm) | Num. of<br>FO4<br>inverter<br>delays | Estimated portion of 160τ clock cycle |
|----------------------------------------------|------------------------------------------------------------------|------------------|------------------|--------------------------------------|---------------------------------------|
| Decoupled Input Port                         | $R3+\rightarrow D0+\rightarrow Di+\rightarrow L-\rightarrow R2-$ | 2.63 ns          | 2.1 ns           | 21                                   | 13 %                                  |
| Decoupled Output Port                        | $A4+\rightarrow A1+\rightarrow A3-$                              | 1.81 ns          | 1.22 ns          | 14                                   | 9 %                                   |
| Simple Input Port with Decoupled Output Port | Latch Delay→A2+→R2-                                              | 2.14 ns          | 1.53 ns          | 16                                   | 10 %                                  |

 $(Note: Unlike\ Table\ 2, our\ 0.35\mu m\ and\ 0.25\mu m\ library\ cells\ exhibit\ 130ps\ and\ 95ps\ FO4\ gate\ delays,\ respectively)$ 

According to *Table 3*, the delays of all three asynchronous controllers are lower than the bound of 37 FO4 gate delays, requiring roughly ten percents of 160 FO4 gate delays clock cycle. This margin allows operating at a slightly higher frequency, if needed. Evidently, this approach is limited by the time we reserve for the MUTEX to resolve. However, it provides a useful operational frequency range for most ASICs.

To preserve timing correctness, careful layout should be performed. The sampling latch, the first register REG1, the asynchronous control and the MUTEX must be placed closely together in order to avoid the impact of wire propagation delay on the critical path. These requirements are expected to be met easily, since the wrapper contains only a single port and is not connected to any other parts of the module.

The overhead of the LDL controller is expected to be less than 100 gates (including the MPG). For example, the decoupled input port controller logic complexity is equivalent to 36 two-input NAND gates, and the MPG requires about 25 gates. For a typical SoC module of 100 KGates, the LDL controller overhead is only 0.1%. Another 0.1% overhead may be incurred by the latches of the input port.

#### 8. Conclusions

We have addressed the problem of synchronization failures due to clock delays in locally generated, arbitrated clocks of GALS SoCs. The problem has been analyzed based on clock delays, cycle time, and complexity of the asynchronous port controllers. The analysis employs a timed STG approach in order to identify potential conflicts spanning asynchronous and synchronous circuits.

Several solutions have been discussed. First, we have shown that timing analysis can be used to verify known solutions (using arbitrated clocks) in the presence of clock delays. Second, a solution employing matched delays is described, where a control signal is delayed so as to match the clock delay and avoid synchronization failures.

A novel architecture for synchronizing inter-modular communications in GALS, based on *locally delayed latching* (LDL), has been presented. LDL synchronization does not require pausable clocking, is insensitive to clock tree delays, and supports high data rates. It replaces the complex global timing constraints on clock delays by simpler, more localized ones. Three different LDL ports have been described, two for input and one for output. Their operation has been demonstrated and analyzed by simulations. We also present a technology-independent analysis of the metastability risk in the synchronizer, and its effect on the synchronizer architecture.

### Acknowledgment

Technion-FORTH research collaboration has been funded in part by ACiD-WG.

#### References

- [1] E.G. Friedman, "Clock Distribution Networks in Synchronous Digital Integrated Circuits," Proc. of the IEEE, 89(5), 665-692, 2001.
- [2] G. Semeraro, D.H. Albonesi, S.G. Dropsho, G. Magklis, S. Dwarkadas, M.L. Scott, "Dynamic frequency and voltage control for a multiple clock domain microarchitecture," Proc. of IEEE/ACM International Symposium on

- Microarchitecture, pp. 356-367, 2002.
- [3] L. S. Nielsen, C. Niessen, J. Sparsø, and C. H. van Berkel. Low-power operation using self-timed and adaptive scaling of the supply voltage. IEEE Transactions on VLSI Systems, 2(4):391-397, December 1994.
- [4] W.R. Daasch, C.H. Lim, G. Cai, "Design of VLSI CMOS Circuits Under Thermal Constraint," IEEE Transactions on VLSI Systems, 49(8), 589-593, Aug. 2002.
- [5] D.M. Chapiro, "Globally-Asynchronous Locally-Synchronous Systems," PhD Dissertation, Stanford University, 1984.
- [6] D. Bormann, P. Cheung, "Asynchronous Wrapper for Heterogeneous Systems," Proc. of ICCD, pp.:307 314, 1997.
- [7] J. Kessels, A. Peeters, P. Wielage, and S.J. Kim, "Clock Synchronization through Handshake Signaling," Proc. ASYNC'02, 59-68, Manchester, UK, 2002.
- [8] S. Moore, G. Taylor, R. Mullins, P. Robinson, "Point to Point GALS Interconnect", Proc. ASYNC'02, Manchester, UK, April 2002.
- [9] S. Oetiker, F.K. Gürkaynak, T. Villiger, H. Kaeslin, N. Felber, W. Fichtner, "Design Flow for a 3-Million Transistor GALS Test Chip," ACiD workshop, January 2003.
- [10] T. Villiger, H. Kaeslin, F.K. Gürkaynak, S. Oetiker, Wolfgang Fichtner, "Self-Timed Ring for Globally-Asynchronous Locally-Synchronous Systems," Proc. ASYNC'03, 141-150, Vancouver, Canada, 2003.
- [11] J. Muttersbach, T. Villiger, W. Fichtner, "Practical Design of Globally-Asynchronous Locally-Synchronous Systems," ASYNC'00, Eilat, Israel, 52-61, 2000.
- [12] K. Y. Yun, R.P. Donohue, "Pausible clocking: a first step toward heterogeneous systems," Proc. of Computer Design: VLSI in Computers and Processors, ICCD'96,118 123, 1996.
- [13] K. Y. Yun, R.P. Donohue, "Pausible clocking-based heterogeneous systems," IEEE Transactions on VLSI Systems, 7(4), 482-488, Dec. 1999.
- [14] A. E. Sjogren and C. J. Myers, "Interfacing Synchronous and Asynchronous Modules within a High-Speed Pipeline," IEEE Transactions on VLSI Systems, 8(5), 573-583, Oct. 2000.
- [15] R. Dobkin, R. Ginosar and C. Sotiriou, "Data Synchronization Issues in GALS SoCs," Proc. of ASYNC'04, April 2004. Crete, Greece, 170-179.
- [16] C.L. Seitz, "System timing," in C.A. Mead and L.A. Conway, eds., Introduction to VLSI Systems, ch. 7, Addison-Wesley, 1980.
- [17] T. Chelcea, S. M. Nowick, "Robust interfaces for mixed-timing systems," IEEE Transactions on VLSI, 12(8), 857-873, 2004.
- [18] A. Chakraborty, M. R. Greenstreet, "Efficient self-timed interfaces for crossing clock domains," Proc. ASYNC 2003 78-88, 2003
- [19] A. Chakraborty, M. R. Greenstreet, "A minimal source-synchronous interface," Proc. ASIC/SOC Conference, 443 447, Sept. 2002.
- [20] S. Chakraborty, Joycee Mekie, and D.K. Sharma, "Reasoning about Synchronization Techniques in GALS Systems: A Unified Approach," Proc. Workshop on Formal Methods in GALS Architectures (FMGALS), 2003.
- [21] T.A. Chu, C.K.C. Leung, T.S. Wanuga, "A Design Methodology for Concurrent VLSI Systems", in Proc. of ICCD, 407-410, 1985.
- [22] J. Mekie, S. Chakraborty, D.K. Sharma, "Evaluation of pausible clocking for interfacing high speed IP cores in GALS framework," Proc. of VLSI Design, 559-564, 2004.
- [23] S.W. Moore, G.S. Taylor, P.A. Cunningham, R.D. Mullins, and P.Robinson, "Self-Calibrating Clocks for Globally Asynchronous Locally Synchronous Systems," Proc. International Conf. Computer Design (ICCD), 2000.
- [24] C. Dike and E. Burton, "Miller and Noise Effects in a Synchronizing Flip-flop," IEEE Journal of Solid-State Circuits, 34(6), 849-855, 1999.
- [25] R. Ginosar, "MTBF of multi-synchronizer SoC," http://www.ee.technion.ac.il/~ran/papers/MTBFmultiSyncSoc.pdf.
- [26] International Technology Roadmap for Semiconductors (ITRS), 2003.
- [27] Y. Semiat and R. Ginosar, "Timing Measurements of Synchronization Circuits," ASYNC 2003, Vancouver, Canada, May 2003.
- [28] R. Ho, K.W. Mai, M.A. Horowitz, "The Future of Wires," Proceedings of IEEE, 89(4), 2001.
- [29] N. Weste, D. Harris, "CMOS VLSI Design A Circuits and Systems Perspective", Addison-Wesley.
- [30] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev, "Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers," IEICE Transactions on Information and Systems, Vol. E80- D, No. 3, 315–325, 1997.
- [31] J. B. Sulistyo, J. Perry, and D. S. Ha, "Developing Standard Cells for TSMC 0.25um Technology under MOSIS DEEP Rules", Dep. Elect. and Comp. Eng., Virginia Tech, Technical Report VISC-2003-01, November 2003.
- [32] J. B. Sulistyo and D.S. Ha, "A New Characterization Method for Delay and Power Dissipation of Standard Library Cells", VLSI Design 15 (3), 667-678, 2002.