# A Fully-Integrated 40-Gb/s Transceiver in 65-nm CMOS Technology

Ming-Shuan Chen, Yu-Nan Shih, Chen-Lun Lin, Hao-Wei Hung, and Jri Lee, Member, IEEE

*Abstract*—This paper introduces a fully-integrated wireline transceiver operating at 40 Gb/s. The transmitter incorporates a 5-tap finite-inpulse response (FIR) filter with LC-based delay lines precisely adjusted by a closed-loop delay controller. The receiver employs a similar 3-tap FIR filter as an equalizer front-end with digital adaptation, and a sub-rate clock and data recovery circuit using majority voting phase detection. The transceiver delivers 40-Gb/s  $2^7 - 1$  PRBS data across a Rogers channel of 20 cm (19-dB loss at 20 GHz) with BER  $< 10^{-12}$  while consuming a total power of 655 mW.

*Index Terms*—Clock and data recovery (CDR), equalizer, finiteinpulse response (FIR) filter, majority voting, transceiver (TRx).

#### I. INTRODUCTION

GIGH-SPEED wireline transceivers (TRx) continue to play important roles in today's communications. Over the past decades, people have pushed the data rate of optical and electrical links from kb/s toward tens of Gb/s. Such an ultra-high bandwidth inspires lots of applications, including wireless personal area network (WPAN) and network storage. For example, 100-Gb/s Ethernet has been fully investigated [1]. Among the proposed solutions, one popular architecture is to partition the 100-Gb/s optical signal into four sub-channels by wavelength division multiplexing (WDM). This four-lane architecture manifests itself in power efficiency and hardware reliability [2], [3]. Another example requiring a high-speed TRx is the so-called light peak technology [4], which provides ultra-fast data transferring between electronic devices. Meanwhile, multi-core processors with increasing computation capabilities also need aggregate I/O bandwidth as high as tens of Gb/s [5]. These applications invoke research on high-speed wireline TRx.

However, there are many difficulties in realizing very broadband TRx. First of all, the channel loss at high frequency is significant. Fig. 1(a) shows the insertion loss of a 20-cm Rogers channel [6] (designed as a 50- $\Omega$  transmission line). Even for such a low-loss material, a 20-cm channel still presents 19-dB loss at 20 GHz. Applying a 40-Gb/s random data into the channel, we have a corresponding time-domain waveform at the far-end as demonstrated in Fig. 1(b). As expected, the

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2011.2176635

eye diagram is fully closed. Actually, a 40-Gb/s data eye would vanish if the channel is longer than 5 cm (12-dB loss at Nyquist frequency). In addition to the channel loss, other issues such as reflection and cross talk would further worsen the situation. A single-bit response for a 20-cm Rogers channel is also introduced in Fig. 1(c). The far-end pulse is heavily attenuated and distorted. Thus, the goal of this paper is to develop a fully-integrated transceiver prototype operating at 40 Gb/s with a pre-emphasis driver in the transmitter (Tx), and an equalizer as well as a clock and data recovery (CDR) circuit in the receiver (Rx).

The 40-Gb/s TRx design in circuit level also involves lots of challenges. In 65-nm CMOS technology, a conventional flipflop (FF) fails to operate beyond 24 Gb/s even with current-mode logic (CML) topology. As data rate increases, a serious issue may occur if we take a flipflop as a sampler. An illustration is shown in Fig. 2, where a conventional CML flipflop is used. Here, the flipflop is operated in full-rate mode, i.e., each data bit is sampled once. If we gradually shift the clock edge of  $CK_{\rm in}$  to the right, the output  $D_{\rm out}$  will not flip immediately at  $\theta = 0$ , but will rather stay in its original state for a finite phase difference  $\theta_1$ . It is because the cross-coupled pair  $M_3-M_4$ needs large enough initial voltage to overcome the mismatch, finite bandwidth, and limited regeneration time. Similarly, as we shift  $CK_{in}$  to the left, the flipflop takes exceeding phase  $(-\theta_1)$  to change state. As a result, a "hysteresis" characteristic appears. In 65-nm CMOS,  $\theta_1$  can be as large as 6° if  $D_{in}$  has a rate of 24 Gb/s. Such a phase uncertainty prohibits the use of a simple flipflop as a phase detector. Even without this issue, the clock-to-Q delay would still destroy the timing (phase) relationship between clock and data. Meanwhile, a high-speed combiner in a feedforward equalizer (FFE) suffers from large parasitic capacitance, which significantly degrades the overall bandwidth. The high-speed CDR and adaptive equalizer design inevitably encounters timing issues and other difficulties.

In our design, we incorporate transmission-line-based delay cells with background calibration to realize a full-rate FFE in the Tx and the Rx. It eliminates stringent timing requirement. A novel CDR structure and an adaptive equalizer combination are implemented in the Rx as well. Fabricated in 65-nm technology, this prototype achieves 40-Gb/s data rate with BER <  $10^{-12}$  over 20-cm (19-dB loss) channel while consuming only 655 mW.

This paper is organized as follows. Sections II and III, respectively, present the transmitter and the receiver design and their considerations. Related building blocks are also introduced. Section IV reveals the experimental results, and Section V summarizes this work with a conclusion.

Manuscript received April 15, 2011; revised October 13, 2011; accepted October 27, 2011. Date of publication December 28, 2011; date of current version February 23, 2012. This paper was approved by Associate Editor Anthony Chan Carusone.

The authors are with the Electrical Engineering Department, National Taiwan University, Taipei, Taiwan (e-mail: jrilee@cc.ee.ntu.edu.tw).



Fig. 1. (a) Measured insertion loss of a 20-cm Rogers channel. (b) Eye-diagram when applying a 40-Gb/s data into it (horizontal scale: 5 ps/div, vertical scale: 100 mV/div). (c) Single-bit response of a 40-Gb/s data passing through a 20-cm channel (horizontal scale: 20 ps/div, vertical scale: 100 mV/div).

 $D_{in} \frown D FF \bigcirc D_{out}$   $CK_{in} \frown D_{in} \frown CK_{in} \frown$ 

Fig. 2. Hysteresis sampling of D-flipflop (simulated in 65-nm CMOS).

# II. TRANSMITTER

# A. Architecture

To fully study the Tx's behavior at high-speed, we investigate the Tx architecture with three different approaches. Note that all these approaches are full-rate. Even though half-rate architecture has been demonstrated to achieve low power consumption at 20 Gb/s [7], [8], they can hardly be used at 40 Gb/s because CMOS logic would dissipate quite significant power at this speed. Duty-cycle distortion would be another issue. As a result, we only discuss full-rate structures here.

Approach I: A basic Tx consists of a feedforward equalizer, a combiner (also known as an output driver), and usually a clock multiplication unit (CMU) which is basically an integer-N phase-locked loop (PLL) providing clocks. Fig. 3(a) depicts a conventional 40-Gb/s Tx design with a 4-tap FFE. This structure has been proven very powerful and therefore widely used. At 40 Gb/s, however, quite a few difficulties may arise. First, for a flipflop to operate at very high speed, output buffers must be added in order to drive the combiner and the next flipflop. Even with a CML structure, the parasitic still causes serious problems. It creates a clock-to-Q delay as large as 15 ps in 65-nm CMOS, which is very significant to one bit period (25 ps). As a result, the next flipflop suffers from misalignment [Fig. 3(b)], i.e., the data output will be shifted to the right and thus the clock edge no longer falls in the center of the data eye. The flipflop would have insufficient time for data regeneration, resulting in inter-symbol interference (ISI). In addition, we need a large clock buffer tree to drive the loading, which not only consumes significant power but increases layout difficulties. In fact, we need at least nine buffers to deliver the full-rate clock to the flipflops evenly, and each of them consumes 9 mA of power. The overall power consumption of the whole transmitter is up to 463 mA.

To verify this observation, we designed and fabricated such a Tx in 65-nm CMOS and measured its performance. As revealed in Fig. 4, the output data eye contains serious ISI even at 30 Gb/s. With 0-dB boosting, error-free operation (i.e., BER  $< 10^{-12}$ ) can only be maintained below 32 Gb/s. It is hard to imagine its pre-emphasis function at 40 Gb/s, as it will lead to more serious ISI. Note that the supply must be raised up to 2 V to make it work properly. If a 1.2-V supply were used, the internal clock would not have adequate swing to switch a flipflop completely.

Approach II: To relax the timing issue in the combiner, we come up with the modified topology as shown in Fig. 5(a). Here,  $\Delta t$  delays are inserted into both the clock and the output data paths. They are made of transmission lines or equivalently, LC



Fig. 3. (a) Approach I's Tx design. (b) Phase misalignment issue due to clock-to-Q delay.



Fig. 4. (a) Tx output data eye of Approach I's Tx at 30 Gb/s (horizontal scale: 5 ps/div, vertical scale: 80 mV/div), and (b) its BER plot as a function of data rate.



Fig. 5. (a) Approach II's Tx design, and (b) its output data eye at 20 Gb/s (horizontal scale: 10 ps/div, vertical scale: 70 mV/div).

networks, to compensate the clock-to-Q delays. More specifically, if we have

$$\Delta t = T_{\rm CK-Q} - \frac{1}{2} \rm UI, \qquad (1)$$

the clock edge will be realigned to data eye center. The skew introduced by  $L_1$  to  $L_4$  can be neutralized by  $L_5$  to  $L_8$  in the

data combination path so that the taps are equally separated by 1-UI delay. To make a fair comparison, we have also designed and fabricated such a Tx in 65-nm CMOS. The  $\Delta t$  delays also absorb parasitic capacitance to some extent. However, this phase compensation method provides negligible improvement to the overall performance, since the overall bandwidth is still limited by that of the flipflops. The measured data eye is presented in Fig. 5(b), which contains serious ISI even at lower



Fig. 6. (a) Proposed transmitter structure. (b) Input data after delays.



Fig. 7. (a) Realization of delay element  $(T_b + \Delta t)$ , and (b) its propagation delay for broadband and narrowband signals.

speed as 20 Gb/s. Possible reason to make such degradation is that the high-speed clock driving flipflops are getting seriously attenuated after travelling through a long distance of wire. The total wire length (including inductors) from  $CK_{in1}$  to  $CK_{in5}$  is 1.05 mm. As can be demonstrated, even with inductive peaking, a CML flipflop can barely operate at 40 Gb/s in [9]. Besides, the transmission-line delay would be subject to process, supply voltage, and temperature (PVT) variations. Simulation shows that the  $\Delta t$  delay would deviate by 15% if both the inductor and the capacitor go off by 20% from the desired values. The power dissipation here is large as well (435 mA), since the clock buffer still drives huge loading. Some other works such as [7] use half-rate finite impulse response (FIR) architecture, but the highest data rate is limited to 20 Gb/s.

Proposed Tx: To overcome the above difficulties, we propose a new Tx structure as shown in Fig. 6(a). It includes a 20-GHz PLL, a 5-tap FFE with tunable delay, and a 1-UI delay generator. The 5-tap FFE incorporates delay element  $T_{\rm b}$ , which is nominally equal to 25 ps, and compensation element  $\Delta t$ . Both of them are made of LC networks. The 40-Gb/s input data through the FIR filter are combined by exactly 25-ps separation with different weighting coefficients. Simulated data (single bit) after delay lines (i.e.,  $D_{in,A}$ ,  $D_{in,B}$ ,  $D_{in,C}$ , and  $D_{in,D}$ ) are shown in Fig. 6(b). With the PLL and frequency divider providing accurate 10-GHz clock, the 1-UI delay generator creates exactly 25-ps delay. The 1-UI delay generator is actually a delay-locked loop (DLL), producing proper control voltage  $V_{ctrl}$  to make the 10-GHz clock have 90° phase shift. Since the  $T_b$ -delay elements in the FFE and the DLL are governed by the same control voltage, the 40-Gb/s data experiences 25-ps delay between taps as well. Note that this loop dynamically tracks the delay and precisely optimizes the performance. The power consumption can be substantially reduced since no retiming flipflop is included.

The reader may wonder that the proposed Tx does not involve full-rate data retiming. Indeed, in this prototype, we emphasize on the realization of an agile FIR at 40 Gb/s. The incoming 40-Gb/s data can be provided by a multiplexer (MUX) in front of this work, which could be driven by a 20-GHz clock [8]. In other words, we can obviate the need for 40-Gb/s data retiming in the whole data path, given that the half-rate clock is reasonably symmetric (i.e., duty cycle = 50%). Theoretically,



Fig. 8. 5-tap FIR combiner and S<sub>11</sub> at output port (looking into the Tx).



Fig. 9. (a) Convergence of the 1-UI delay generator. (b) Phase detector and its characteristic. (c) Data output for 0%, 10%, and 20% delay errors.

half-rate structures can be applied to our delay-line based FFE as well, but the extra power dissipated in the second data path makes it less attractive.

# B. Building Blocks

The  $T_{\rm b}$  delay line is implemented as shown in Fig. 7(a). Here, a differential pair  $M_1-M_2$  propagates input data into an emulated transmission line made of distributed inductors and varactors. For each 25-ps delay, we have seven LC segments. By tuning  $V_{\rm ctrl}$  from ground to  $V_{\rm DD}$ , the input-to-output delay can be varied from 23 to 30 ps, which corresponds to a tuning range of 28%. Actually an extra delay  $\Delta t$  has been introduced to compensate the inter-tap delay in the combiner. Fig. 7(b) illustrates the simulated propagation delay (only  $T_{\rm b}$  is considered) for 10-GHz clock and 40-Gb/s data passing through this emulated transmission line. One important observation is that, even though the 25-ps delay is created from the 10-GHz clock, the artificial transmission line still provides good consistency for broadband data.<sup>1</sup> Fortunately, fine LC segmentation here (L = 345 pH, C = 45 fF) guarantees good transmission line approximation. The maximum group delay deviation between 40-Gb/s data and 10-GHz clock is less than 1 ps across the whole tuning range. As compared with conventional flipflopbased delay elements, the power consumption for each UI delay has been reduced from 26 mW to 12 mW.

In addition to accuracy, another important advantage of this approach is that most parasitic capacitances are absorbed into the transmission lines, and hence increasing the bandwidth considerably. By the same token, the FIR combiner must have sim-

<sup>&</sup>lt;sup>1</sup>It is well known that if a broadband signal [e.g., non-return-to-zero (NRZ) data] goes through a non-ideal transmission line, different frequency components would suffer from different phase shift, leading to dispersion. As a result, a broadband data and a narrowband clock may experience different delay time.



Fig. 10. Proposed Rx architecture.

ilar resonating element between taps, otherwise the routing parasitic would degrade the output data. A resonating element here is actually one segment of the LC network in the  $T_{\rm b}$  delay line [10], [11]. The 5-tap FIR combiner has a structure shown in Fig. 8. Here, each LC combination represents a  $\Delta t$  delay. As a result, net delay between two adjacent taps is given by  $T_{\rm b}$  +  $\Delta t - \Delta t = T_{\rm b} = 25 \text{ ps.}$  Ideally, the number of taps in the Tx can be increased to further eliminate the ISI. However, if the number of taps is too large, the LC delay line of the combiner would present a low-pass response, which may cause distortion at high frequencies. More power would be dissipated as well due to extra taps. In that sense, we select 5-tap FFE in this prototype. Note that one LC delay fits very well in our layout arrangement. To increase data swing, the output port is made as an open drain and is connected to the Rx directly. Single 50- $\Omega$  terminator in the Rx side has insignificant influence on the reflection. Simulation shows that even with 20% deviation, the reflected signal of a 200-mV pulse after travelling a round trip through a 20-cm channel would be less then 3 mV. A 50- $\Omega$  impedance matching has been implemented in the Tx's output and the Rx's input. Such a dc connection between Tx and Rx increases the data swing by a factor of two if good matching can be maintained. The simulated  $S_{11}$  of the Tx's output is plotted in Fig. 8, presenting the return loss less than -5.8 dB from dc to 20 GHz.

The 1-UI delay generator, which is actually a DLL, is unconditionally stable. Note that the generated delay must be precise, otherwise the delay deviation would increase ISI and jitter significantly. With phase detector gain =  $200 \ \mu$ A/rad and loop capacitor  $C = 10 \ \text{pF}$ , the loop approaches a steady state within 0.7  $\mu$ s [Fig. 9(a)]. The phase detector design is shown in Fig. 9(b). To lock two 10-GHz clocks in quadrature, we use a single-sideband (SSB) mixer to distill the phase error [12]. That is, if the phase difference between the two inputs is represented as  $\Delta \theta$ , we have the output  $V_{\rm PD}$  as

$$V_{\rm PD} = k_1 A_1 A_2 \cos(\Delta \theta) \tag{2}$$

where  $k_1$  denotes the mixer gain, and  $A_1$  and  $A_2$  the input amplitudes, respectively. The sinusoidal characteristic forces the phase error to be locked at 90° with an approximately linear behavior in the vicinity. Note that non-idealities such as secondorder harmonic caused by mismatch can be easily suppressed by the RC network at the output node. The only issue this linear phase detector would encounter is the mismatch between the SSB mixer's two paths. Monte Carlo simulation suggests that in 65-nm CMOS, one standard deviation of the signal path mismatch could cause equivalently 0.79-ps phase error between the two inputs, which is 3.2% of one data bit at 40 Gb/s. Nonetheless, we depict the simulated output waveform with 10% and 20% delay deviations in Fig. 9(c). The ISI increases by 3% and 7% as compared with the ideal case, and the jitter by 11% and 23%, respectively. Other than the delay variation caused by inductors and capacitors, active devices and resistors may lead to significant delay variation as well owing to inadequate modeling. Placing a precise delay generator on chip eliminates these issues. Without the DLL's automatic calibration, the delay may go off by more than 20% easily. It manifests the need for a DLL here.

## III. RECEIVER

The 40-Gb/s receiver design also involves considerable challenges. Recent implementations of 40-Gb/s CDRs adopt multi-phase sampling with sub-rate clocks [13]–[15], making clock generation and distribution very important. Meanwhile, at



Fig. 11. (a) Illustration of reference level for two extreme cases. (b) Realization of adaptation unit.

40 Gb/s, both analog and decision feedback equalizers (DFEs) become very challenging due to the insufficient bandwidth and finite flipflop setup time, respectively. We present our receiver solution in this section.

## A. Architecture

Fig. 10 illustrates the receiver design. Unlike typical approaches, we use a 3-tap adaptive FIR filter as the front-end equalizer [16], [17]. The receiver also contains a phase-interpolation-based CDR, and a 1-to-64 demultiplexer. It is well known that a conventional DFE suffers from stringent speed requirement in its feedback path, and presents solutions such as sub-rate or loop unrolled structures involve complicated design and significant power consumption [18], [19]. Using analog equalizers also encounters bandwidth and accuracy issues. Simulation suggests that a 40-Gb/s source-degenerated differential pair with inductive peaking provides at most 3.6 dB boosting at 20 GHz. Meanwhile, the one-dimension adaptability makes precise compensation very difficult.

To overcome these issues, we place a 3-tap FIR in the Rx front-end. It basically follows the 5-tap feedforward equalizer in the Tx. The only difference is that the FIR in the Rx has adaptability. Working together with that in the Tx, the 3-tap adaptive FIR filter in the Rx compensates the remaining ISI caused by PVT variations. Note that the number of taps in the receiver cannot be arbitrarily large, otherwise the nonlinearity of the delay line would cause undesired distortion. Thus, only 3 taps are used in the Rx. Providing up to 14-dB boosting at Nyquist

frequency (i.e., 20 GHz), the delay in the adaptive FIR equalizer also gets dynamically calibrated by a 1-UI delay generator. In this prototype, we have no limiting amplifier in the Rx. A design with a limiting amplifier such as that in [3] can possibly be used in the future. In this quarter-rate architecture, the 40-Gb/s data is sampled and demultiplexed by the semi-quadrature clocks of 10 GHz and other lower frequency clocks produced by the CDR. With the edge also sampled by this clock, the CDR adjusts the clock phase accordingly. It is very similar to a regular quarter-rate Alexander phase detector [20], but with only one edge processed every 4-bit period for simplicity. If necessary, the hardware here can be easily modified to cover every single bit in a re-design. The adaptation unit uses sign-sign LMS algorithm [21], [22] to optimize the equalization. Signals in both CDR and equalization units are processed in digital domain, and 64 parallelized output data streams are available at 625 Mb/s. Note that we have omitted  $\phi_{270}$  and the corresponding demultiplexes here simply because both CDR and adaptation unit do not need to deal with the data demultiplexed in this phase.

# B. Adaptation

The FIR filter adaptation is realized as follows. First, the reference swing generator, the slicer (driven by  $\phi_{90}$ ), and the adaptation logic detect the average data swing (i.e., reference data level). For a data bit to be optimally compensated, the eye center (i.e., the most-open point) must locate around this level. Thus, we adjust the 3 coefficients  $\alpha_{-1}$ ,  $\alpha_0$ ,  $\alpha_1$  in a way such that the average rms error is minimized. The sign-sign LMS algorithm is applied accordingly. Meanwhile, since the overall tail current



Fig. 12. (a) Reference-level generator. (b) Opamp design and its response.



Fig. 13. (a) CDR logic. (b) 7-bit iDAC.

of the FIR combiner is a constant, the data common-mode level must be acquired. The adaptation unit is implemented as shown in Fig. 11. The 16 demultiplexed data and compensation error signal  $E_n$  are sent to the LMS logic, which takes a majority vote to determine the polarity. A 15-bit up-down counter serves as a digital loop filter, in which 3-bit bandwidth control is included to optimize the convergence speed if necessary. Subsequently, three 7-bit iDACs provide coefficients  $\alpha_{-1}$ ,  $\alpha_0$ ,  $\alpha_1$  to the FIR equalizer to adjust the boosting. For  $\alpha_{-1} = \alpha_1 = -0.2$  and  $\alpha_0 = 0.6$ , for instance, the 3-tap FIR presents a boosting of 7.4 dB. All the blocks in adaptation unit are realized as digital circuits with CMOS (rail-to-rail) logic synthesized by Synopsys Design Vision and Astro [23], [24]. Note that the total

current amount in the FIR combiner is a constant. The voting result for  $E_n \oplus D_n$  is to create present swing information for reference-level generator.

The reference-level generator is illustrated in Fig. 12(a). Here,  $M_1-M_2$  pair is fully tilted, and the swing-reference iDAC provides a tail current directly connected to the common-source node of  $M_1-M_2$ . The Opamp along with  $M_3-M_6$  loop ensures that the common-mode level of reference swing  $V_{\rm SW}$  is equal to that of the data output, which is important for the circuit to make a fair amplitude comparison. The Opamp design is illustrated in Fig. 12(b), which achieves 35-dB gain and 300-kHz bandwidth with 81° phase margin while consuming only 0.56 mW of power.



Fig. 14. (a) Multiphase clock generator. (b) Phase interpolator and linearity.

# C. CDR

The CDR design employs binary operation at sub-rate (i.e., 625 MHz). Fig. 13(a) depicts the CDR circuit. Similar to the adaptation unit, the demultiplexed data and edges are applied to CDR logic to perform bang-bang phase detection. Again, phase adjustment is accomplished by majority voting. After passing through the digital loop filter, those 7-bit MSBs are fed into a phase decoder, generating 7-bit I/Q signals. The 10-GHz clock is therefore rotated clockwise or counterclockwise based on the phase difference. The 7-bit iDAC is realized in current mode, where the 3 MSBs are in thermometer code and the 4 LSBs in binary [Fig. 13(b)] [25]. Meanwhile, 3-bit codes are applied to the digital loop filter to serve as a bandwidth control. Due to the low CDR bandwidth, the loop latency of the digital filter has only negligible effect on the performance.

The clock generator has to provide semi-quadrature clocks at 10 GHz, and multiphase clocks at 5, 2.5, 1.25, and 0.625 GHz. As presented in Fig. 14(a), a 20-GHz PLL multiplies the 312.5-MHz reference by a factor of 64, and after a divided-by-2 circuit the 10-GHz I and Q signals are fed into the phase interpolators (PIs). While the two PIs in the first stage rotate I and Q clocks based on CDR's output, the three PIs in the second stage span them into required clock phases. Simulation shows

that maximum phase deviation of these clock phases would be less than 2% (0.5 ps) over PVT variation. The lower frequency clocks are therefore produced by the subsequent divider chain. We carefully adjust the layout so that all the clock paths are balanced and the overall power is optimized.

Fig. 14(b) illustrates the phase interpolator design, where the tail currents are differentially controlled by the bias signals from the CDR. Conventional phase interpolator tends to switch the clock polarity before interpolation in order to lower the output capacitance and reduce the power consumption. However, it also introduces a lot of jitter directly to the output clock. In this design, we adopt fully-differential approach in the tail current sources to avoid switching clocks, arriving at better performance in terms of clock jitter. Simulation shows that the maximum integral nonlinearity (INL) and differential nonlinearity (DNL) are less than 3 and 0.3 LSBs, respectively. As compared with conventional designs [26], our approach improves the linearity significantly.<sup>2</sup>

The 20-GHz VCO follows typical LC tank design with bottom resistor boosting up the common-mode level [Fig. 5(a)]. A differential pair with inductively-peaked loading is placed as

<sup>2</sup>Other clock phase generator design can be found in [27]. Our approach actually combines the advantages of a conventional one and [27].



Fig. 15. (a) 20-GHz VCO. (b) First divider.

a clock buffer. The under-damped characteristic significantly saves power. The divider is implemented as a standard static divided-by-2 circuit in CML [Fig. 5(b)]. The VCO presents a tuning range of 1.6 GHz and the divider a lock range of 32.5 GHz for a 0-dBm input power.

### **IV. EXPERIMENTAL RESULTS**

The transceiver has been designed and fabricated in 65-nm CMOS technology. Fig. 16(a) shows the die photos. The Tx occupies  $0.9 \times 0.7 \text{ mm}^2$  and the Rx  $1.35 \times 0.85 \text{ mm}^2$ . The Tx consumes 135 mW from a 1.2-V supply, of which 29 mW dissipates in the 20-GHz PLL, 77 mW in the 5-taps FIR, and 29 mW in 1-UI delay generator. The Rx consumes 520 mW, where the analog and digital circuits use 1.6-V and 1.2-V supplies, respectively. Due to insufficient bandwidth, we increase the analog supply to 1.6 V in this prototype. This non-standard supply on the Rx has no impact on the Tx since the data swing is determined by the amount of tail current of the open-drain driver in the Tx. Both chips are mounted on a Rogers board. Response for different channel lengths are investigated in our measurement to fully study the performance. The testing setup is also shown in Fig. 16(b). High-speed probes provides the original data input to the Tx and captures the final data output from the Rx. Low-speed and dc lines are wire-bonded to the board directly.

The 20-GHz PLL in the Tx presents a phase noise of -91.6 dBc/Hz at 1-MHz offset [Fig. 17(a)]. The overall rms jitter integrated from 10 Hz to 6.5 GHz is equal to 0.58 ps. The time domain waveform of the 20-GHz clock is also plotted in Fig. 17(b). The rms jitter captured in time domain matches the integration result from the spectrum if the oscilloscope's noise



(b)

Fig. 16. (a) Chip micrographs (b) testing setup.

is deembedded [28]. Fig. 18 shows the Tx's output at 40 Gb/s with different boosting, i.e., no pre-emphasis (left) and 9.5-dB

0/20 GHz

Spectrum

(FSUP)

40 Gb/s

Oscilloscope

(86100C)

10 Gb/s

BERT

N4903A)



Fig. 17. Measurements of the 20-GHz PLL: (a) phase noise plot; (b) the waveform (horizontal scale: 2 ps/div, vertical scale: 30 mV/div).



Fig. 18. Tx's output at 40 Gb/s (horizontal scale: 5 ps/div, vertical scale: 50 mV/div).



Fig. 19. Tx's output data eye for the extreme cases (horizontal scale: 5 ps/div, vertical scale: 50 mV/div).

boosting (right). To make sure the 1-UI delay generator functions properly, we deliberately tuned the reference frequency to the edges and observe the data output. Since the PLL's tuning range covers from 19.4 to 21 GHz, we show the eye diagrams of Tx's output (before the channel) for 38.8 and 42 Gb/s in Fig. 19. The optimal coefficients are set as  $\alpha_{-2} = -0.10$ ,  $\alpha_{-1} = 0.75$ ,  $\alpha_0 = -0.09$ ,  $\alpha_1 = 0.04$ , and  $\alpha_2 = -0.02$ , respectively. The pretty open eyes in both cases demonstrate proper delay-line tracking and accurate FIR functioning.

In the Rx side, the equalized 40-Gb/s data is shown in Fig. 20. Here, simulation suggests the optimal coefficients are  $\alpha_{-1} =$  -0.11,  $\alpha_0 = 0.75$ , and  $\alpha_1 = -0.14$ , respectively. For a 5-cm channel, we have data jitter of 2.7 ps,rms and 15.1 ps,pp, respectively. The worst case occurs at 20 cm, where the eye is closer and the rms and peak-to-peak jitter slightly increase to 2.8 ps and 17.1 ps, respectively. Fig. 21 reveals the spectrum of the recovered quarter-rate clock (10 GHz) from CDR, presenting phase noise of -115 dBc/Hz at 1-MHz offset. The integration jitter is approximately equal to 319 fs, which matches time-domain measurement. The measurement here is carried out with a channel length of 20 cm. Fig. 22 shows the output data (full-rate and demultiplexed), for the case of 10-cm channel. The 40-Gb/s



Fig. 20. Equalized 40-Gb/s data in the Rx after equalization for (a) 5-cm, and (b) 20-cm Rogers channel (horizontal scale: 5 ps/div, vertical scale: 50 mV/div).



Fig. 21. Recovered clock at 10 GHz: (a) phase noise plot, and (b) its waveform (horizontal scale: 20 ps/div, vertical scale: 60 mV/div).



Fig. 22. Recovered data (in the Rx) for a 10-cm channel: (a) 40-Gb/s output right after equalizer (horizontal scale: 5 ps/div, vertical scale: 60 mV/div), and (b) demuxed 10-Gb/s output. (horizontal scale: 20 ps/div, vertical scale: 60 mV/div).

and 10-Gb/s recovered data present 2.7 and 4.0 ps rms jitter, respectively. Fig. 23 illustrates the worst-case output data in the Rx side. After a channel of 20 cm, the 40-Gb/s data  $(2^7 - 1 \text{ PRBS})$  is barely open even with the help of front-end equalizer. The CDR cleans up the jitter and ISI as expected. Note that BER <  $10^{-12}$  can still be obtained in the case. The recovered 10-Gb/s data have jitter of less than 7.4 ps,rms and 59.1 ps,pp, respectively. It should be stated that to capture Fig. 23(a), a 40-Gb/s data buffer must be incorporated. It inevitably introduces parasitic capacitance and deteriorates the data eye. The actual 40-Gb/s data handled by the flipflops of the core circuits is much clearer since the signal bandwidth inside is much

wider. For longer data pattern, the BER degrades to some extent. As shown in Fig. 24(a), for 40-Gb/s PRBS of length  $2^{31}-1$ , the transceiver achieves BER of less than  $10^{-12}$  until channel length becomes longer than 17.5 cm, which has 18-dB loss at 20 GHz. Here, the BER is measured on one of the 10 Gb/s outputs. For 20-cm channel, BER begins to increase as the data pattern gets longer. The jitter tolerance with  $2^7-1$  PRBS pattern is also plotted in Fig. 24(b). The CDR loop bandwidth is estimated to be 2.4 MHz, also as expected. The measured input sensitivity is 32 mV.

The performance of this work and other state-of-the-art TRx is summarized in Table I. Our design covers a range from 38.8 to



Fig. 23. Recovered data (in the Rx) for the worst case (20-cm channel): (a) 40-Gb/s output right after equalizer (horizontal scale: 5 ps/div, vertical scale: 60 mV/div), and (b) demuxed 10-Gb/s output. (horizontal scale: 20 ps/div, vertical scale: 60 mV/div).



Fig. 24. (a) BER test. (b) Jitter tolerance.

TABLE IPerformance Summary.

|                          | [29]                                          | This Work                                    |
|--------------------------|-----------------------------------------------|----------------------------------------------|
| Data Rate                | 40Gb/s (36.2~38.2Gb/s)                        | 40Gb/s (38.8~42Gb/s)                         |
| EQ. Arch.                | Linear                                        | Tx: 5-Tap FIR, Rx: 3-Tap FIR                 |
| CDR Arch.                | Half-Rate,                                    | Quarter-Rate,                                |
|                          | Bang-Bang PD                                  | Phase-Interpolated                           |
| BER                      | < 10 <sup>-14</sup> , 2 <sup>15</sup> -1 PRBS | < 10 <sup>-12</sup> , 2 <sup>7</sup> -1 PRBS |
|                          | (4.6-dB Loss @ 20GHz)                         | (19-dB Loss @ 20GHz)                         |
| Recovered                | 1.77ps,rms (1.14GHz)                          | 319fs,rms (10GHz)                            |
| Clock Jitter             |                                               | ·····,                                       |
| Recovered<br>Data Jitter | N/A                                           | 7.36ps,rms (10Gb/s)                          |
| Supply                   | 1.45V                                         | Tx: 1.2V, Rx: 1.6V*                          |
| Power Diss.              | Tx: 1.56W, Rx: 2.04W                          | Tx: 135mW, Rx: 520mW                         |
| Chip Area                | Tx: 1.7 x 2.2mm <sup>2</sup>                  | Tx: 0.9 x 0.7mm <sup>2</sup>                 |
|                          | Rx: 1.7 x 2.9mm <sup>2</sup>                  | Rx: 1.35 x 0.85mm <sup>2</sup>               |
| Technology               | 0.13μm CMOS                                   | 65nm CMOS                                    |

\* 1.2V used in Digital Logic

42 Gb/s, achieving  $< 10^{-12}$  BER communication up to 20-cm Rogers channel while consuming a total power of 655 mW. The area and power dissipation are much less than those of [29].

# V. CONCLUSION

This work presents a novel realization and calibration method for very high-speed FIR equalizers. Precise and delicate timing adjustment is implemented in advanced CMOS technology. Complete adaptation and CDR function are also included. This work provides promising potential for future very broadband TRx design.

#### ACKNOWLEDGMENT

The authors thank the TSMC University Shuttle Program for chip fabrication.

#### REFERENCES

- M. Nowell *et al.*, "Overview of Requirements and Applications for 40 Gigabit and 100 Gigabit Ethernet," Ethernet Alliance, Beaverton, OR, 2007.
- [2] C. Cole *et al.*, "100 GbE-optical LAN technologies," *IEEE Commun. Mag.*, vol. 45, no. 12, pp. 12–19, Dec. 2007.
- [3] K.-C. Wu and J. Lee, "A 2 × 25-Gb/s receiver with 2:5 DMUX for 100-Gb/s Ethernet," *IEEE J. Solid-State Circuits*, vol. 45, no. 11, pp. 2421–2432, Nov. 2010.
- [4] S. Addagatla *et al.*, "Direct network prototype leveraging light peak technology," in *Proc. Hot Interconnects 18 (HOTI 2010)*, Sep. 2010, pp. 109–112.
- [5] K. Chang *et al.*, "Clocking and circuit design for a parallel I/O on a first-generation CELL processor," in *IEEE ISSCC Dig. Tech. Papers*, 2005, pp. 526–527.
- [6] Rogers Corporation. [Online]. Available: http://www.rogerscorp.com
  [7] B. Casper *et al.*, "A 20 Gb/s forwarded clock transceiver in 90 nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, 2006, pp. 90–91.
- [8] H. Wang and J. Lee, "A 40-Gb/s transmitter with 4:1 MUX and subharmonically injection-locked CMU in 90-nm CMOS technology," in *Symp. VLSI Circuits Dig. Tech. Papers*, 2009, pp. 48–49.
- [9] T. Chalvatzis et al., "Low-voltage topologies for 40-Gb/s circuits in nanoscale CMOS," *IEEE J. Solid-State Circuits*, vol. 42, no. 7, pp. 1564–1573, Jul. 2007.
- [10] H. Wu et al., "Differential 4-tap and 7-tap transverse filters in sige for 10 Gb/s multimode fiber optic link equalization," in *IEEE ISSCC Dig. Tech. Papers*, 2003, pp. 180–181.
- [11] A. Momtaz and M. M. Green, "An 80-mW 40-Gb/s 7-tap T/2-spaced feed-forward equalizer in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 45, no. 3, pp. 629–639, Mar. 2010.
- [12] J. Lee et al., "A 75-GHz phase-locked loop in 90-nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 43, no. 6, pp. 1414–1426, Jun. 2008.
- [13] P. K. Hanumolu et al., "A wide-tracking range clock and data recovery circuit," *IEEE J. Solid-State Circuits*, vol. 43, no. 2, pp. 425–439, Feb. 2008.
- [14] T. Toifl et al., "A 72 mW 0.03 mm<sup>2</sup> inductorless 40 Gb/s CDR in 65 nm SOI CMOS," in *IEEE ISSCC Dig. Tech. Papers*, 2007, pp. 226–227.
- [15] S. Kaeriyama et al., "40 Gb/s multi-data-rate CMOS transmitter and receiver chipset with SFI-5 interface for optical transmission systems," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3568–3579, Dec. 2009.
- [16] K.-L. J. Wong *et al.*, "Edge and data adaptive equalization of serial-link transceivers," *IEEE J. Solid-State Circuits*, vol. 43, no. 9, pp. 2157–2169, Sep. 2008.
- [17] F. Spagna et al., "A 78 mW 11.8 Gb/s serial link transceiver with adaptive RX equalization and baud-rate CDR in 32 nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, 2010, pp. 366–367.
- [18] T. O. Dickson *et al.*, "A 12-Gb/s 11-mW half-rate sampled 5-tap decision feedback equalizer with current-integrating summers in 45-nm SOI CMOS technology," *IEEE J. Solid-State Circuits*, vol. 44, no. 4, pp. 1298–1305, Apr. 2009.

- [19] H. Wang and J. Lee, "A 21-Gb/s 87-mW transceiver with FFE/DFE/ analog equalizer in 65-nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 909–920, Apr. 2010.
- [20] J. D. H. Alexander, "Clock recovery from random binary signals," *Electron. Lett.*, vol. 11, no. 22, pp. 541–542, Oct. 1975.
  [21] A. C. Carusone and D. A. Johns, "Digital LMS adaptation of analog
- [21] A. C. Carusone and D. A. Johns, "Digital LMS adaptation of analog filters without gradient information," *IEEE Trans. Circuits Syst. II*, vol. 50, no. 9, pp. 539–552, Sep. 2003.
- [22] V. Stojanovic *et al.*, "Autonomous dual-mode (PAM2/4) serial link transceiver with adaptive equalization and data recovery," *IEEE J. Solid-State Circuits*, vol. 40, no. 4, pp. 1012–1026, Apr. 2005.
- [23] DFT Compiler, Standard Scan Synthesis. [Online]. Available: http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Documents/dftcompiler ds.pdf
- [24] Astro, Advanced Physical Optimization, Placement and Routing Solution for System-on-Chip Designs. [Online]. Available: http://www.europractice.rl.ac.uk/vendors/snps astro ds.pdf
- [25] X. Wu et al., "A 130 nm CMOS 6-bit full Nyquist 3 GS/s DAC," IEEE J. Solid-State Circuits, vol. 43, no. 11, pp. 2396–2403, Nov. 2008.
- [26] R. Kreienkamp et al., "A 10-Gb/s CMOS clock and data recovery circuit with an analog phase interpolator," *IEEE J. Solid-State Circuits*, vol. 40, no. 3, pp. 736–743, Mar. 2005.
- [27] K. Yamaguchi *et al.*, "A 2.5-GHz four-phase clock generator with scalable no-feedback-loop architecture," *IEEE J. Solid-State Circuits*, vol. 36, no. 11, pp. 1666–1672, Mar. 2001.
- [28] J. Lee and B. Razavi, "A 40-Gb/s clock and data recovery circuit in 0.18-µm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 38, no. 12, pp. 2181–2190, Dec. 2003.
- [29] J.-K. Kim et al., "A fully integrated 0.13-µm CMOS 40-Gb/s serial link transceiver," *IEEE J. Solid-State Circuits*, vol. 44, no. 5, pp. 1510–1521, May 2009.



Ming-Shuan Chen was born in Taipei, Taiwan, in 1984. He received the B.S. degree in electrical engineering from National Tsing Hua University, Taiwan, in 2006, and the M.S. degree from the Graduate Institute of Electronics Engineering, National Taiwan University, Taiwan, in 2008. From 2009 to 2010, he was a research assistant at the Graduate Institute of Electronics Engineering, National Taiwan University, where he worked on 40-Gb/s wired transceiver circuit design. He is currently working toward the Ph.D. degree in integrated circuits and

systems at the University of California at Los Angeles.

His research focuses on high-speed mixed-signal circuit design.



**Yu-Nan Shih** was born in Taichung, Taiwan, in 1986. He received the B.S. and M.S. degrees in electrical engineering from National Taiwan University in 2008 and 2011, respectively.

His research interests mainly focus on wireline backplane transceiver design.



**Chen-Lun Lin** was born in Taoyuan, Taiwan, in 1988. He received the B.S. degrees in electrical engineering and physics from National Taiwan University, Taiwan, in 2010. He is currently pursuing the M.S. degree in the Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan. His research interests focus on wireline backplane transceiver design



**Hao-Wei Hung** was born in Taipei, Taiwan, in 1988. He received the B.S. degree in electrical engineering from National Taiwan University, Taiwan, in 2010. He is currently pursuing the M.S. degree in the Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan.

His research interests include phase-locked loops and wireline transceivers for broadband data communication.



Jri Lee (S'03–M'04) received the B.Sc. degree in electrical engineering from National Taiwan University (NTU), Taipei, Taiwan, in 1995, and the M.S. and Ph.D. degrees in electrical engineering from the University of California at Los Angeles (UCLA), both in 2003.

After two years of military service (1995–1997), he was with Academia Sinica, Taipei, Taiwan, from 1997 to 1998, and subsequently Intel Corporation from 2000 to 2002. He joined National Taiwan University (NTU) in 2004, where he is currently

a Professor of electrical engineering. His current research interests include high-speed wireless and wireline transceivers, phase-locked loops and applications, and mm-wave circuits.

Prof. Lee received the Beatrice Winner Award for Editorial Excellence at the 2007 ISSCC, the Takuo Sugano Award for Outstanding Far-East Paper at the 2008 ISSCC, the Best Technical Paper Award from the Y. Z. Hsu Memorial Foundation in 2008, the T. Y. Wu Memorial Award from the National Science Council (NSC) of Taiwan in 2008, the Young Scientist Research Award from Academia Sinica in 2009, and the Outstanding Young Electrical Engineer Award in 2009. He also received the NTU Outstanding Teaching Award in 2007, 2008, and 2009. He has served on the Technical Program Committees of the IEEE International Solid-State Circuits Science 2008, and the Asian Solid-State Circuits Conference (A-SSCC) since 2005. He was a guest editor of the IEEE JOURNAL OF SOLID-STATE CIRCUITS in 2008. He is currently a Distinguished Lecturer of the IEEE Solid-State Circuits Society (SSCS).