# A 225 MHz Resonant Clocked ASIC Chip

Conrad H. Ziesler, Joohee Kim, Visvesh S. Sathe, Marios C. Papaefthymiou

Advanced Computer Architecture Laboratory Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, MI, USA

{ cziesler, jooheek, vssathe, marios } @eecs.umich.edu

# ABSTRACT

We have recently designed, fabricated, and successfully tested an experimental chip that validates a novel method for reducing clock dissipation through energy recovery. Our approach includes a singlephase sinusoidal clock signal, an L-C resonant sinusoidal clock generator, and an energy recovering flip-flop. Our chip comprises a dual-mode ASIC with two independent clock systems, one conventional and one energy recovering, and was fabricated in a  $0.25 \mu m$ bulk CMOS process. The ASIC computes a pipelined discrete wavelet transform with self-test and contains over 3500 gates. We have verified correct functionality and obtained power measurements in both modes of operation for frequencies up to 225MHz. In the energy recovering mode, our power measurements account for all of the dissipation factors, including the operation of the integrated resonant clock generator, and show a net energy savings over the conventional mode of operation. For example, at 115MHz, measured dissipation is between 60% and 75% of the conventional mode, depending on primary input activity. To our knowledge, this is the first ever published account of a direct experimentallymeasured comparison between a complete energy recovering ASIC chip and its conventional implementation correctly operating in silicon at frequencies exceeding 100MHz.

### **Categories and Subject Descriptors**

B.0 [Hardware]: General

# Keywords

adiabatic logic, clock generator, CMOS, low energy, resonant LC tank, single phase, VLSI, flip-flop

### 1. INTRODUCTION

A popular approach to low-energy, high-throughput VLSI system design is voltage-scaled static CMOS with aggressive pipelining. In these systems, due to the large number of flip-flops and loading of the clock tree, the dissipation of the clock tree and state elements (flip-flops) can often be a substantial fraction of total system dissipation. Clock gating is an effective, though design in-

Copyright 2003 ACM 1-58113-682-X/03/0008 ...\$5.00.

tensive, approach to reducing the dissipation of idle flip-flops and branches of the clock tree.

We recently proposed a novel design methodology for energy recovery utilizing a new PMOS energy recovering flip-flop (pTERF) and a novel single-phase resonant clock generator [1]. In this paper, we describe the design, implementation, and testing of a dual-mode (energy recovering and conventional) ASIC chip that validates this methodology. We also describe our experimental methods for testing the chip and obtaining power measurements. The chip has been fabricated and successfully tested at speeds of 115MHz and 225MHz.



Figure 1: Die-photo of test chip

Figure 1 shows a microphotograph of the entire chip. In the lower left corner is our dual-mode energy recovering experiment, consisting of a dual-mode ASIC core, a resonant clock generator, and some testing logic. To obtain a dual-mode system, we have replaced each state element in a synthesized ASIC design with a pair of flip-flops, one conventional and one energy recovering, joined with a mode-select multiplexer. The conventional flip-flops are driven with a conventional clock tree. The energy recovering flipflops are driven by a resonant sinusoidal clock. This dual-mode approach allows for direct energy and functionality comparisons, by holding nearly all of the design variables (such as wire length) constant between the two designs. Thus, our experimental characterization of each mode of operation reflects the differences solely in the clock system.

We fabricated a test chip containing our dual-mode (conventional and energy recovering) ASIC in a  $0.25\mu$ m logic process. Measurement results are promising. At 115MHz, conservative measurements of energy dissipation in energy-recovery mode are 60% and 75% of the dissipation in conventional mode during low and high switching activities, respectively. Furthermore, the measured min-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ISLPED'03, August 25-27, 2003, Seoul, Korea.

imum supply voltage for correct operation in the energy recovery mode is either the same or less than the minimum correct supply voltage for the conventional mode, indicating that the energy recovering flip-flop is at least as fast and/or as skew tolerant as the conventional mode.

While several energy recovering (a.k.a. adiabatic) circuits and clocking schemes have been proposed in the literature (a few representative examples are in [2, 3] and [4]), very few of these systems have successfully demonstrated working silicon with integrated power-clock generation. In addition, the few chips that did work either were targeted at low frequencies or were only demonstrated at high speed in simulation. We believe our success at demonstrating high-speed, efficient operation with integrated power-clock generation is due to our use of a single-phase sinusoidal power-clock combined with targeted use of energy recovery.

Our single-phase sinusoidal power-clock approach has several advantages over other energy recovering methodologies such as simple clock generation and distribution, no need for phase-balancing, skew tolerance, single inductor tuning, and low transistor count. Effectively, our energy recovering technique enables a substantial energy reduction with minimal designer effort. Much of our success can be attributed to the combined approach we used, where our resonant clock generator, clock distribution network, and energyrecovering flip-flop were designed together as a coherent system.

The remainder of this paper is organized as follows: Section 2 describes our dual-mode self-testing discrete wavelet ASIC design. Section 3 describes the conventional and energy recovering flip-flops we used for dual-mode operation. Section 4 describes the conventional and energy recovering clock generation and distribution networks. Section 5 describes our experimental characterization of our test ASIC. We describe ongoing work in Section 6.

#### 2. DUAL-MODE ASIC CHIP



Figure 2: Arithmetic pipeline for discrete wavelet transform

To evaluate our new energy recovery ASIC design methodology, we have implemented a discrete wavelet transform datapath to be used as the first stage in a neural signal processing chip. For the purposes of our energy-recovering experiment, the datapath was driven by self-test logic rather then a neural signal. As shown in Figure 2, the discrete wavelet datapath consists of two pipelined multipliers, several pipelined adders, a FIFO, and some control circuits. The datapath bitwidth is 7 bits, with a total of 3897 gates. We targeted a  $0.25\mu$ m logic process available through MOSIS, using a simple cell library that we developed and included our energy recovering flip-flop. The voltage was aggressively scaled down to 1.5V while meeting a 333MHz throughput goal.

Figure 3 shows a conceptual schematic of our dual-mode chip.



Figure 3: Dual-mode clocking system overview.

To obtain a dual-mode system, each flip-flop after synthesis is replaced with a pair of flip-flops, one conventional and one energy recovering, joined with a mode-select multiplexer. The conventional flip-flops are driven with a conventional clock tree, as indicated by the chain of buffers. The energy recovering flip-flops are driven by a resonant sinusoidal clock which is pumped by an NMOS switch. Both clock systems derive their timebase from the same input clock source. By changing a global *mode\_select* signal, the design can be switched (during reset) between conventional and energy recovering state elements, thus enabling direct energy consumption comparisons. This dual-mode approach allows for direct energy and functionality comparisons by holding nearly all of the design variables (such as wire length) constant between the two designs. Thus our experimental characterization of each mode of operation reflects the differences solely in the clock system.

The entire self-testing datapath was synthesized in a bottom-up hierarchical strategy using Synopsys DC\_SHELL. After substituting the dual-mode flip-flop and multiplexer construct, the structural netlist was placed with the Cadence QPLACE tool. A script then placed Vdd, Ground, and Power-Clock distribution wires in repeated stripes on the top metal layer over the entire core, and modified the power, ground, and power-clock nets to use the nearest metal stripe. The Cadence WROUTE tool then routed the final netlist, connecting the global nets to the nearest stripe.

Besides the discrete wavelet core, there were several other components on the test chip. A single-mode energy recovering ASIC core shared the same power bus as the dual-mode core. Two clock generators, each with an independent enable signal, were placed in parallel to allow for a wide range of operating frequencies and inductor Q's. Four bond pads were devoted to each of the core Power, Ground, and Power-clock signals to provide a low impedance connection to off chip. In addition, the chip contained a ring-oscillator and timing control unit (each with separate isolated supplies) as well as a system-wide testing block containing the main scan-chain interface.

### **3. STATE ELEMENTS**

In this section we describe the state elements we used in our experimental comparison. To simplify the experiment, we used only a single type and size of flip-flop for both the conventional and energy recovering mode. Both flip-flops were sized to operate at up to 500MHz in simulation.

# 3.1 Conventional



Figure 4: Schematic of conventional flip-flop used for comparisons.

A schematic of the conventional flip-flop we used is shown in Figure 4. This is a standard textbook design using combined passgate inverters. The flip-flop has an area of  $128 \ \mu m^2$  and a typical D-Q delay of 950ps at 1.5V. The energy consumption (in simulation) is 132fJ/cycle when D is changing and 55fJ/cycle when D is held constant, not counting the clock tree.

While this is by no means the lowest power or fastest conventional static CMOS flip-flop, it does serve as a good baseline for comparison purposes and scales well to low voltages. Due to its simplicity, it is easier to obtain correct operation than with many of the more aggressive flip-flops proposed in the literature [5, 6, 7, 8]. Notice that this flip-flop (as well as many other static CMOS flipflops) internally inverts the clock signal, causing some dissipation even if the input D is constant.

#### 3.2 Energy Recovering

A schematic of the energy recovering flip-flop we used in our experiments is shown in Figure 5. The flip-flop consists of an energy recovering dynamic buffer driving a pair of cross-coupled NOR gates as the static latch element. We include one internal inverter on D to derive a complemented input and one inverter on the output



Figure 5: Schematic of PMOS energy recovering flip-flop (pTERF).

to increase the drive strength (although it is not strictly necessary).

The power-clock node  $P_{clk}$  supplies both power and timing information to the circuit, in contrast to conventional clock nodes which supply only timing information. Our flip-flop latches on rising pulses of  $P_{clk}$ . The input needs to be stable by the time  $P_{clk}$  is roughly half way to its peak, and should be held stable until  $P_{clk}$  is at the peak. Note that correct operation is dependent on the ratioing of the pull-down NMOS and the pull-up cross coupled PMOS. Dissipation is lowest when used with a slightly longer hold time, although correct operation is obtained even with short hold times.



Figure 6: Operational waveforms for pTERF.

Key properties of our flip-flop include near-zero dissipation when the input data is held constant, low overall dissipation when the input is changing, low-voltage operation, compact layout, and a D-Q delay which is inversely proportional to frequency. These properties are derived from the internal energy recovering dynamic buffer. As seen in the operation waveforms in Figure 6, the internal dynamic nodes xt and xf follow  $P_{clk}$  in a smooth, slow transition, whenever the input d is held constant.

As shown in Figure 6, the operation of pTERF begins with the data input changing at a suitable time before the rising edge of  $P_{clk}$ . The cross coupled PMOS devices sense and latch the appropriate value of D onto the nodes xt and xf. Since the cross coupled NOR gates form a simple set/reset latch, we have that positive pulses on either xt or xf will cause the latch to either set or reset, respectively. When D is not changing, either xt or xf will remain low, with the other node oscillating in phase with  $P_{clk}$  in an energy recovering manner, that is, transferring charge to/from the  $P_{clk}$  signal. This charge recycling probe operation is the key to the ultra low energy consumption at zero input switching activity.

Figure 7 plots the total flip-flop delay (setup time plus clock to output delay) as a function of operating frequency. At 200MHz and 1.0V, pTERF requires 1,280ps, while at 500MHz and 1.5V, it requires 570ps. At 500MHz, 1.8V (not shown on the graph), pTERF requires only 460ps. Notice the trend that flip-flop delay decreases with increasing frequency. This behavior is due to the sinusoidal shape of the energy recovering  $P_{clk}$  waveform. At all frequencies, pTERF consumes roughly a quarter of the total clock period. Increasing the voltage reduces this fraction slightly.

Figure 8 shows the energy consumption (in simulation) of pTERF as a function of operating frequency for two different input data conditions, idle (never switching) and active (always switching). The idle energy consumption is near zero at all frequencies, with a



Figure 7: pTERF D-Q delay (Ts+Tq)



Figure 8: Active and idle energy consumption of pTERF as function of frequency.

dissipation of 1.7fJ at 200MHz and 4.89 fJ at 500MHz. The active energy consumption was measured both at the minimum hold time (for the case of cascaded flip-flops), and at a nominal hold time (for the case of logic between flip-flops). For all frequencies, the lower point on the error-bar indicates the energy dissipation at the nominal hold time.

### 4. CLOCK TREE

#### 4.1 Conventional



#### Figure 9: Partial representation of conventional clock tree

The conventional mode clock tree consists of a depth 4 tree of inverters whose sizes increases by a factor of 3 each stage. For each cycle in the conventional clock system, total dissipation amounts to 76fJ per flip-flop, including internal flip-flop clock nodes. This dissipation level is obtained for a relatively small clock tree of only 387 flip-flops. For larger systems, the overhead should grow as the depth of the tree and the length of the interconnect increases.

#### 4.2 Energy Recovering

The energy recovering clock tree consists of a single-wire powerclock that is forked to each flip-flop in the system. Any capacitance on this wire is resonated with a lumped inductor, and thus dissipates very little energy. This resonant system is driven by the powerclock generator that converts energy from the DC supplies into AC energy using a lumped inductor and an on-chip NMOS switch.



Figure 10: Block diagram of power-clock generator

The power-clock generator is composed of a control circuit, a large NMOS power transistor and associated drive circuitry, and a lumped inductor connected to a DC supply which is half that of the logic Vdd supply, as shown in Figure 10. Synchronization occurs with an input reference square-wave clock, which is fed into 3 delay lines that generate the appropriate timing signals for the control circuit. The controller compares the peak value of powerclock with a reference voltage. For each cycle, it decides whether or not the inductor current needs replenishing. The output of the controller is a pulse which is buffered and inverted by two ratioed inverters before connecting to the gates of a PMOS pull-up and an NMOS pull-down. These two devices drive the gate terminal of the main NMOS power switch. The sizes of the transistors in the ratioed inverters are chosen so that the pull-up and pull-down are never on at the same time. In addition, the main switch is turned on slowly, but turned off quickly, thus minimizing dissipation due to the PMOS pull-up capacitance. The main NMOS power switch is turned on at the time when the voltage difference between  $P_{clk}$ and ground is small, replenishing the current in the inductor from the DC supply.

The single-cycle controller is built around a two-stage clocked comparator circuit connected to a set-reset latch, as shown in Figure 11. A low-to-high transition on d3 causes the difference between  $P_{clk}$  and the reference voltage to be amplified by the cross coupled inverters. The result of this comparison toggles the set-reset latch. The phase difference between d1 and d2 is used to generate a pulse which is gated by the current state of the set-reset latch and fed to the output. Thus the controller efficiently implements single-cycle feedback control.

Figure 12 shows typical operational waveforms of the clock generator. The g signal is the gate drive of the main power switch. The pc signal is the power-clock signal. At 100ns, the load connected to pc undergoes a step increase in dissipation. As a result,



Figure 11: Single-cycle controller



Figure 12: Power-clock generator operation

the clock generator, which was operating under a 2-on, 2-off periodicity, switches behavior to full-on.

## 5. EXPERIMENTAL RESULTS

In this section we describe our experimental characterization of our test ASIC. Our chip was fabricated in a  $0.25\mu$ m process through MOSIS using the SCN5M\_DEEP.12 scalable design rules. The chip was packaged in a Kyocera PGA108M standard 108-pin ceramic package, using a single row of bondpads distributed evenly around the periphery of a 3.1 mm by 3.1 mm die. The package cavity was 8.9 by 8.9 mm. Thus the average bondwire length was roughly 2.9 mm. Typical package parasitics include a 4–6 nH series inductance, a 0.2–0.4 Ohm series resistance, and a 1–3pF parallel capacitance. In addition, due to the manufacturing plating bus, there was a 0.5–5 nH parallel inductance in series with a 0.2–1.5 pF capacitance and a 0–0.3 Ohm resistance.

For the energy recovering mode, our power measurements include all of the dissipation of the power-clock generator, power-clock DC bias, and ASIC core. In the conventional mode, our measurements include all of the dissipation in the ASIC core, including the clock distribution buffers. In either case, dissipation does not include I/O supply, testing circuits, or ring oscillator/timing control unit. All of the dissipation results are computed from  $I_{dc} \cdot V_{dc}$  measured at each supply input to the chip. Any AC-currents present were decoupled with bypass capacitors and RF-choke coils. Currents were measured as  $I = \Delta V/10\Omega$ , that is, the voltage dif-



Figure 13: Power measurement circuit used for each independent DC supply

ference across a  $10\Omega \pm 1\%$  thin-film surface-mount resistor. The decoupling and current measuring circuit is shown in Figure 13.



Figure 14: Comparison of measured signature output at 225MHz for conventional (top trace) and energy recovery (bottom trace). 200,000 points were sampled at 200ps intervals with a Tektronix TDS 7404 Digital Oscilloscope.

Verification of correct circuit operation was done using the programmable self-test operational mode of the ASIC module. A standard signature analyzer circuit based on a linear-feedback shiftregister produced a serial bitstream, a sequence of which was matched against a programmable expected result register with successful matches toggling a single output bit. We thus derived a low-frequency periodic output from the high-frequency signature pattern, suitable for full-speed chip testing. Figure 14 shows a comparison of captured data between energy recovering and conventional modes. The data (200,000 sampled points at 200ps/point) was captured using a Tektronix TDS 7400 Digital Oscilloscope, and plotted using Matlab. These waveforms also match the expected results from Verilog simulations.

The energy recovery mode necessitates tuning the clock frequency to match the resonant frequency of the LC system. Experimentally, this tuning was accomplished by sweeping the clock frequency while observing an output pin from the chip that is driven with a chain of inverters that buffer and amplify the internal sinusoidal power-clock signal. The output signal is nearly a square wave after passing through the inverter chain needed to amplify the signal to drive the output pad. When the frequency and duty cycle are tuned correctly, a stable periodic signal is observed on the buffered power-clock signal.

Figure 15 gives a screen dump from the oscilloscope showing the self-test output signal and the buffered power-clock signal. The



Figure 15: Screen dump of signature output and a buffered copy of power-clock for 225MHz operation. Top half is view with timescale of  $4\mu$ s/division, bottom half is view at 8ns/division.

| 115MHz (external L=5nH)   | self-test | reset  |
|---------------------------|-----------|--------|
| Energy Recovery           | 8.9mW     | 2.6mW  |
| Conventional              | 11.7mW    | 4.3mW  |
| 225MHz (parasitic only L) | self-test | reset  |
| Energy Recovery           | 29.5mW    | 15.0mW |
| Conventional              | 35.4mW    | 13.8mW |

Table 1: Measured power dissipation

top half of the screen is the expanded time view, while the bottom half is the zoomed-in view showing the roughly 4.3ns clock period.

Table 1 gives our primary experimental measurements. We measured the energy of the entire system, excluding I/O drivers and testing circuits. At 115MHz the energy recovering mode dissipates around 60% of the conventional mode while in reset (i.e., nearly zero switching activity). During self-test (i.e., high switching activity), the energy recovering mode dissipates around 75% that of the conventional mode. The measurements at 115MHz were performed with two parallel 10nH surface mount inductors soldered to the printed circuit board adjacent to the PGA socket connection.

In comparison, the measurements at 225MHz were performed with no external inductance, using solely the package and bondwire parasitic inductances. Unfortunately, this parasitic inductance is somewhat lossy, yielding power dissipation that is higher than expected from simply scaling up the power from the 115MHz point. These losses are reflected by the larger than expected DC current drawn from the  $P_{clk}$  half-Vdd bias supply. We thus attribute the extra dissipation to the parasitic losses from utilizing the package parasitic inductance as the resonant element. It is likely that these losses are due to eddy currents being induced in adjacent metalization in the package. Thus for high-speed resonant clocking, we conclude that custom designed packages or on-chip inductors may be necessary.

From the 115MHz and 225MHz measurements, we can estimate the equivalent L and C for the two experiments. Assuming C is constant between the two frequencies (slight changes in the PCB were necessary), we compute an effective C of 283pF along with roughly 1.8nH of parasitic inductance within the package and bondwire structure. This is comparable to the expected 4–6nH/pin divided by 4 pins in parallel, or roughly 1–1.5nH from the MOSIS supplied package characterization.

# 6. CONCLUSION

We have presented a comprehensive experiment directly comparing an energy recovering ASIC chip working correctly at 115MHz and 225MHz with the same chip operating in conventional mode as a control. The energy recovering ASIC chip implements a nontrivial pipelined datapath with energy recovering flip-flops, a singlephase sinusoidal clock tree with wire segments of nearly one millimeter, and a resonant clock generator with single-cycle feedback control. Industry standard automated design tools were used. Direct energy comparisons were made from power measurements with favorable results, 60% to 75% of conventional dissipation at 115MHz. At 225MHz, energy consumption was 108% and 83% that of conventional, limited primarily by lossy inductance formed by the wires within the package.

The main contribution of our work is our silicon-validated, targeted application of energy recovery to specific high-dissipation loads on the chip, namely the clock distribution network. By designing a resonant clock generator, single-phase clock distribution network, and energy recovering flip-flop together as a coherent system, we have successfully demonstrated efficient high-speed working chips. Our experimental energy measurements provide a basis for optimizing the design of the resonant inductance as a future step in achieving efficient, high-speed energy-recovering VLSI systems.

# 7. ACKNOWLEDGMENTS

This research was supported in part by the US Army Research Office under AASERT Grant No. DAAG55-97-1-0250 and Grant No. DAAD19-99-1-0304.

### 8. **REFERENCES**

- C. H. Ziesler, J. Kim, and M. C. Papaefthymiou, "Energy recovering ASIC design," in *Proceedings of International Symposium on VLSI*, Feb. 2003.
- [2] W. C. Athas, L. J. Svensson, J. G. Koller, N. Tzartzanis, and Y. Chou, "Low-power digital systems based on adiabatic-switching principles," *IEEE Transactions on VLSI Systems*, vol. 2, no. 4, pp. 398–406, Dec. 1994.
- [3] D. Maksimovic, V. G. Oklobdzija, B. Nikolic, and K. W. Current, "Clocked CMOS adiabatic logic with integrated single-phase power-clock supply," *IEEE Transactions on VLSI Systems*, vol. 8, no. 4, pp. 460–463, Aug. 2000.
- [4] W. Athas, N. Tzartzanis, W. Mao, L. Peterson, R. Lal, K. Chong, J.S. Moon, L. Svensson, and M. Bolotski, "The design and implementation of a low-power clock-powered microprocessor," *IEEE Journal of Solid-State Circuits*, vol. 35, no. 11, pp. 1561–1570, Nov. 2000.
- [5] B.S. Kong, S.S. Kim, and Y.H. Jun, "Conditional-capture flip-flop for statistical power reduction," *IEEE Journal of Solid-State Circuits*, vol. 36, no. 8, pp. 1263–1271, Aug. 2001.
- [6] C. Kim and S.M. Kang, "A low-swing clock double-edge triggered flip-flop," in *Symposiumq on VLSI Circuits*, 2001, pp. 183–186.
- [7] V. Stojanovic and V. G. Oklobdzija, "Comparative analysis of master-slave latches and flip-flops for high-performance and low-power systems," *IEEE Journal of Sold-State Circuits*, vol. SC-34, no. 4, pp. 536–548, Apr. 1999.
- [8] J. Yuan and C. Svensson, "New single-clock CMOS latches and flipflops with improved speed and power savings," *IEEE Journal of Solid-State Circuits*, vol. SC-32, no. 1, pp. 62–69, Jan. 1997.