

# An ulra-low-energy/frame multi-standard JPEG co-processor in 65nm CMOS with sub/near-threshold power supply.

*Citation for published version (APA):* Pu, Y., Pineda de Gyvez, J., Corporaal, H., & Ha, Y. (2009). An ulra-low-energy/frame multi-standard JPEG coprocessor in 65nm CMOS with sub/near-threshold power supply. In *Proceedings of the IEEE International Solid-State Circuits Conference 2009, ISSCC 2009, 8-12 February 2009, San Francisco, CA, USA* (pp. 146-147a). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ISSCC.2009.4977350

DOI: 10.1109/ISSCC.2009.4977350

### Document status and date:

Published: 01/01/2009

#### Document Version:

Publisher's PDF, also known as Version of Record (includes final page, issue and volume numbers)

#### Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

#### General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- · Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
  You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the "Taverne" license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

#### Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.n

providing details and we will investigate your claim.

## ISSCC 2009 / SESSION 8 / MULTIMEDIA PROCESSORS / 8.1

#### 8.1 An Ultra-Low-Energy/Frame Multi-Standard JPEG Co-Processor in 65nm CMOS with Sub/Near-Threshold Power Supply

Yu Pu<sup>1,2,3</sup>, Jose Pineda de Gyvez<sup>1,2</sup>, Henk Corporaal<sup>1</sup>, Yajun Ha<sup>3</sup>

<sup>1</sup> Eindhoven University of Technology, Netherlands

- <sup>2</sup>NXP Semiconductors, Eindhoven, Netherlands
- <sup>3</sup>National University of Singapore, Singapore

Many digital ICs can benefit from sub/near threshold operations that provide ultra-low-energy/operation for long battery lifetime. In addition, sub/near threshold operation largely mitigates the transient current hence lowering the ground bounce noise. This also helps to improve the performance of sensitive analog circuits on the chip, such as delay-lock loops (DLL), which is crucial for the functioning of large digital circuits. However, aggressive voltage scaling causes throughput and reliability degradation. This paper presents SubJPEG, a state of the art multi-standard 65nm CMOS JPEG encoding coprocessor that enables ultra-wide  $V_{DD}$  scaling. With a 0.45V power supply, it delivers 15fps 640×480 VGA application with only 1.3pJ/operation energy consumption per DCT and quantization computation. This co-processor is very suitable for applications such as digital cameras, portable wireless and medical imaging. To the best of our knowledge, this is the largest sub-threshold processor so far.

The architecture of SubJPEG is shown in Fig. 8.1.1. The design is fully compliant with the JPEG encoder baseline standard [1]. Asynchronous FIFOs (AFIFOs) are located at the front-end of the data-path to enable an flexible interface to standard bus interfaces such as PCI/PCI-X/PCI-Express. For each frame, the external main CPU issues a command to the configuration register file of the JPEG processor. The command includes the source data start address/length, destination data start address, YUV sampling ratio, programmable quantization table coefficients, etc. SubJPEG accommodates two command slots in the configuration register file so as to minimize the inter-frame configuration latency. The JPEG data-path has three main stages: (1) 2D-DCT transformation, (2) Quantization, and (3) Huffman encoding. A pair of DCT and Quantization modules is denoted as an "engine". SubJPEG has 4 engines in parallel in order to mitigate throughput degradation. It exploits 2 supply voltage domains and 3 frequency domains. The configuration and interface operate with bus clock and  $V_{\text{DDH}}$ , the Huffman encoder functions with fast clock and  $V_{DDH}$ , while the engines function with *slow* clock and  $V_{DDH}$ . However, to reduce the I/O pads of this DMA processor, logic has been added to multiplex certain I/O signals, so an additional  $V_{\text{MUX}}$  domain working under a *shift* clock is implemented. Signals across different clock domains utilize handshaking to increase robustness.

The  $V_T$  variation is the dominant component for sub-threshold current variation due to its exponential correlation to the current [2]. A novel configurable  $V_{\tau}$  balancer which uses only one bulk line to balance the  $V_{\tau}$  of PMOS and NMOS transistors has been shown in Fig 8.1.2. An off-chip capacitor is needed to mitigate ripple. Each engine has a deep N-well island and a dedicated  $V_T$ balancer located at its corner. When the engines are set in super-threshold mode, the tri-state buffer is configured to be in high impedance state. The bulk of the PMOS transistors is then configured to be connected to  $V_{\mbox{\tiny DD}},$  and the bulk of the NMOS transistors is configured to be connected to  $G_{\mbox{\tiny ND}}.$  When the engines are configured to be in the sub/near threshold mode, the tri-state buffer starts to function. Any fluctuation of the signal  $V_{out}$ , which is generated from a process-corner  $V_{\tau}$  imbalance detector, is thus detected and amplified by the tri-state buffer. In this mode, the power switches are configured so that the buffer's output voltage directly supplies the bulk of all the logic gates of the engine. This output is also fed back to the bulk of the  $V_7$  balancing detector to force PMOS/NMOS  $V_{T}$  balancing. As the bulk controlling line is never higher than  $|V_{\tau}|$ , so the p-n junction diodes are prevented from turning on. The

power switch transistors S<sub>0</sub>, S<sub>1</sub> and S<sub>2</sub> are designed with NMOS transistors with their gate control voltage boosted to 1.2V, which is the VDDIO for pads, so the driving capability is increased. The Monte-Carlo transient time simulation for a critical path at V<sub>DD</sub>=0.4V shows that the standard deviation  $\sigma$  is reduced by 4.7× and the  $\sigma/\mu$  is reduced by 3.6× when the configurable  $V_7$  balancer is used.

Our previous research has revealed that,  $V_T$  mismatch of paired transistors working in sub-threshold can be worse by a factor of two as compared to transistors working in the super-threshold region [3]. While  $V_T$  mismatch is always thought as notorious, an interesting observation is that the  $V_T$  mismatch between parallel transistors can be utilized to increase the sub/near threshold current drivability, as shown in Fig. 8.1.3. Therefore, wide sub-threshold power switches, such as  $S_5$  and  $S_6$  in the  $V_T$  balancer, are therefore preferably divided into narrow transistors. This is realized by utilizing a multiple-finger structured transistor. As a result, the drivability of power switches is significantly increased without increasing layout area.

A sub-threshold standard cell library has been developed to synthesize the engines. Compared to existing super-threshold libraries, in the sub-threshold library the standard cells are resized so as to pass Monte-Carlo simulation with high confidence. Besides, certain circuit structures are strictly prohibited, as illustrated in Fig. 8.1.4; cells with more than three parallel-transistors or more than three stacked transistors are avoided (only PMOS transistors are drawn for clarity). Also avoided are ratioed logic cells.

Also shown in Fig. 8.1.4 is the 2-stage level shift scheme used in subJPEG. The 1st stage level shifting is performed through simple buffers which are capable of pulling up signals from sub-threshold V\_{DDL} to V\_{DDH} (~V\_{DDL}+300mV). The 2nd stage level shifting is performed through feedback structured level-shifters from V\_{DDH} to 1.2V I/O pad.

The chip is fabricated in a 65nm 7-layer standard  $V_{\tau}$  CMOS process. A micrograph of the chip is shown in Fig. 8.1.7. The core area is 1.4×1.4mm<sup>2</sup> without pads. Fig. 8.1.5 shows the waveforms of some control signals from the logic analyzer. Also shown is the measurement result for the  $V_{\tau}$  balancer. Measurements of energy and speed performance are summarized in Fig. 8.1.6.

#### Acknowledgements:

The authors thank Leo Sevat and Maurice Meijer for the support during backend and testing of the chip.

#### References:

[1] Gregory K. Wallace, "The JPEG Still Picture Compression Standard," *IEEE Trans. Consumer Electronics*, vol. 38, issue 1, pp. xviii–xxxiv, February, 1992.

[2] Yu Pu, José Pineda de Gyvez, Henk Corporaal and Yajun Ha, "V<sub>T</sub> Balancing and Device Sizing Towards High Yield of Sub-threshold Static Logic Gates," *IEEE International Symp. Low Power Electronics and Design (ISLPED)*, pp. 355-358, Aug. 2007.
[3] José Pineda de Gyvez ,and Hans P.Tuinhout, "Threshold Voltage Mismatch and Intra-

[3] José Pineda de Gyvez ,and Hans P.Tuinhout, "Threshold Voltage Mismatch and Intra-Die Leakage Current in Digital CMOS Circuits," *IEEE J. Solid-State Circuits*, Vol. 39, No.1, pp. 157–168, Jan. 2004.

146 • 2009 IEEE International Solid-State Circuits Conference

978-1-4244-3457-2/09/\$25.00 ©2009 IEEE



# ISSCC 2009 / February 10, 2009 / 8:30 AM

DIGEST OF TECHNICAL PAPERS • 147

engines and possible real-time image applications.

# Please click on paper title to view Visual Supplement.

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on November 25, 2009 at 09:25 from IEEE Xplore. Restrictions apply.

from oscilloscope.

# **ISSCC 2009 PAPER CONTINUATIONS**

| Figure 8.1.7: Die micrograph and core layout of SubJPEG test chip in 65nm cMoS. |  |
|---------------------------------------------------------------------------------|--|
|                                                                                 |  |
|                                                                                 |  |

• 2009 IEEE International Solid-State Circuits Conference

978-1-4244-3457-2/09/\$25.00 ©2009 IEEE

# Please click on paper title to view Visual Supplement.

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on November 25, 2009 at 09:25 from IEEE Xplore. Restrictions apply.