An Ultra-Low-Energy Multi-Standard JPEG Co-Processor in 65 nm CMOS With Sub/Near Threshold Supply Voltage

doi:10.1109/JSSC.2009.2039684

668 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 3, MARCH 2010

An Ultra-Low-Energy Multi-Standard JPEG

Co-Processor in 65 nm CMOS With

Sub/Near Threshold Supply Voltage

Yu Pu, Member, IEEE, Jose Pineda de Gyvez, Fellow, IEEE, Henk Corporaal, Member, IEEE, and

Yajun Ha, Senior Member, IEEE

Abstract—We present a design technique for (near) subthreshold

operation that achieves ultra low energy dissipation at through-

puts of up to 100 MB/s suitable for digital consumer electronic ap-

plications. Our approach employs i) architecture-level parallelism

to compensate throughput degradation, ii) a conﬁgurable

T

bal-

ancer to mitigate the

T

mismatch of nMOS and pMOS transis-

tors operating in sub/near threshold, and iii) a ﬁngered-structured

parallel transistor that exploits

T

mismatch to improve current

drivability. Additionally, we describe the selection procedure of

the standard cells and how they were modiﬁed for higher relia-

bility in the subthreshold regime. All these concepts are demon-

strated using

SubJPEG,a

1 4 1 4

mm

2

65 nm CMOS stan-

dard-

T

multi-standard JPEG co-processor. Measurement results

of the discrete cosine transform (DCT) and quantization processing

engines, operating in the subthreshold regime, show an energy dis-

sipation of only 0.75 pJ per cycle with a supply voltage of 0.4 V at

2.5 MHz. This leads to

8 3

energy reduction when compared to

using a 1.2 V nominal supply. In the near-threshold regime the en-

ergy dissipation is 1.0 pJ per cycle with a 0.45 V supply voltage at

4.5 MHz. The system throughput can meet 15 fps 640

480 pixel

VGA standard. Our methodology is largely applicable to designing

other sound/graphic and streaming processors.

Index Terms—JPEG, parallel architecture, sub-threshold, ultra

low energy.

I. INTRODUCTION

W

ITH the ever-shrinking feature size, the number of

transistors integrated in one digital core doubles ap-

proximately every two years. The increasing transistor density

greatly challenges the limited battery life and thermal properties

of the IC. Exploring a design methodology for ultra low-energy,

“green” digital circuits is thus very important. One of the most

effective means to achieve these goals is to scale the supply

voltage

along with the operating frequency. As

scales, not only does the dynamic energy reduce quadratically,

but also the leakage current does reduce super-linearly due to

the drain-induced barrier-lowering (DIBL) effect. Therefore,

Manuscript received June 24, 2009; revised September 09, 2009. Current ver-

sion published February 24, 2010. This paper was approved by Associate Editor

Bevan Baas.

Y. Pu was with the Ultra Low Power DSP Processor Group, IMEC-NL, 5656

AE Eindhoven, The Netherlands, and is now with the Sakurai Lab, University

of Tokyo, Tokyo 153-8505, Japan (e-mail: y.pu@tue.nl).

J. Pineda de Gyvez is with NXP Semiconductors, 5656 AE Eindhoven, The

Netherlands.

H. Corporaal is with the Faculty of Electrical Engineering, Eindhoven Uni-

versity of Technology, 5612 AZ Eindhoven, The Netherlands.

Y. Ha is with the Department of Electrical and Computer Engineering, Na-

tional University of Singapore, 117576 Singapore.

Digital Object Identiﬁer 10.1109/JSSC.2009.2039684

the total energy dissipation of a circuit can considerably be

reduced. In addition,

scaling reduces transient current

spikes, hence lowering the notorious ground bounce noise.

This also helps to improve the performance of sensitive analog

circuits on the chip, such as delay-lock loops (DLL), which are

crucial for the functioning of large digital circuits.

In contrast to analog circuit design where lowering the

to the subthreshold region is generally avoided because of the

small values of the driving currents and the exceedingly large

noise, CMOS digital logic gates can work seamlessly from full

to well below threshold voltage . Theoretically, oper-

ating digital circuits in the near/sub-threshold region

can help obtain huge energy savings. However, the design

rules provided by foundries normally set 2/3 of the full

as the practical limitation for scaling. Taking Samsung’s

DVFS Design Technology [1] and TSMC’s design rules as ex-

amples, the constraint of

for digital circuits designed in

CMOS 65 nm Standard

Process is in the 1.2

range. The reasoning behind the limitation is twofold. First,

as

scales, the driving capability of transistors reduces ac-

cordingly. Most consumer electronic applications need oper-

ating frequencies in the range of tens of MHz to reach cer-

tain throughput, which might not be fulﬁlled with aggressive

scaling. Second, digital circuits become particularly sensi-

tive to process variations when

scales below 2/3 full .

Process variations are likely to cause malfunctioning, and both

the timing yield and functional yield may tremendously de-

crease. As a result,

is generally chosen to maintain an

adequate margin to prevent high yield loss and to keep quality

according to industrial standards. The goal of our work is to

safely evade this limitation so as to enable wide range voltage

scaling, from nominal supply to near/sub threshold.

Sub/near threshold techniques have been explored in recent

years. Fig. 1 shows a comparison of the computation efﬁ-

ciency (GOPS/W) and throughput (MOPS) of our

SubJPEG

co-processor and other existing subthreshold processors. Like-

wise, Table I summarizes the most relevant work in the ﬁeld.

In contrast to the work presented in those publications, our

work has some unique features. Firstly, we explore the use

of architecture-level parallelism to compensate throughput

degradation at ultra-low supply values. Parallelism along with

sub/near threshold techniques is best suited for low-energy

and medium frequency applications, such as mobile image

processing. Secondly, this work proposes a conﬁgurable

balancer to lessen the mismatch between nMOS and pMOS

transistors, such that both the functional and the timing yield

PU et al.: ULTRA-LOW-ENERGY MULTI-STANDARD JPEG CO-PROCESSOR IN 65 nm CMOS WITH SUB/NEAR THRESHOLD SUPPLY VOLTAGE 669

Fig. 1. Computation efﬁciency and throughput of this work and other works.

TABLE I

S

UMMARY OF

EXISTING SUB-THRESHOLD

WORK

are increased. Thirdly, we make use of design approaches that

exploit parallel-transistor

mismatch to improve drivability

in power switches, and of design strategies that select a reliable

cell library for logic synthesis, and that turn ratioed logic into

non-ratioed logic to improve the robustness of our design in

the subthreshold regime. To demonstrate these ideas, we have

designed and implemented a 65 nm

CMOS ultra-low

energy multi-standard JPEG co-processor.

The remainder of this paper is organized as follows. Section II

presents the physical-level effort we have made for an enhanced

circuit yield. In Section III, the architecture of SubJPEG is in-

troduced in detail. Section IV presents key design issues and

the evaluation results of the prototype chip. Finally, Section V

draws conclusions of this work.

II. P

HYSICAL LEVEL EFFORT FOR AN ENHANCED YIELD

A. Conﬁgurable

Balancer

mismatch dominates the subthreshold current variation

due to its exponential correlation to the current. Since tran-

sistor

is controlled by an independent doping process,

pMOS/nMOS

can vary signiﬁcantly with respect to each

other. Consequently, this variability can result in lower circuit

yield. For example, at the fast nMOS slow pMOS corner

(FNSP) where the nMOS network is much leakier than the

pMOS network, a sufﬁciently high output voltage

may

not be reached. Similarly, an insufﬁciently low voltage

can happen when at the fast pMOS slow nMOS corner (SNFP).

Even if the noise margin can be met, either the rising or falling

time becomes exceedingly long at process corners, which also

dramatically deteriorates the timing yield. Therefore, it is very

important to balance the

of pMOS and nMOS transistors.

We propose a conﬁgurable

balancing scheme (Fig. 2),

which enables ultra wide range

scaling from the nominal

supply voltage to sub-threshold. This conﬁgurable

balancer

is an extension of our previous work [20]. Our

balancer

is also different from the regulator presented in [21] since

it uses an imbalance detector which has a better sensitivity.

Also, it uses an ampliﬁer in the feedback loop to enhance the

sensitivity, and, it is conﬁgurable to support wide

tuning.

Let us address now the operation of our

balancer. When the

processor works in the super-threshold mode,

is off such that

the tri-state buffer is conﬁgured to be in a high impedance state.

Since the power switch transistors

and are on, and ,

670 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 3, MARCH 2010

Fig. 2. Proposed conﬁgurable

V

balancer.

are off, the bulk of the pMOS transistors is connected to ,

and the bulk of the nMOS transistors is connected to

.

When the processor is conﬁgured to work in the subthreshold

mode,

is on, and thus the tri-state buffer is functional. In this

mode,

, are on, and , are off. Therefore, the buffer’s

output voltage passes through

, and to feed the bulk of

the logic gates. A CMOS inverter, whose pMOS and nMOS

transistors are off, functions as a process-corner

imbalance

detector. Observe that

is never higher than pre-

venting in this way the junction diodes from turning on in the

P-well and N-well under control.

and are designed

in advance to be at

in the typical process corner (TT).

ﬂuctuates with the variations of process and temperature.

The buffer detects and ampliﬁes the swing of

. The buffer’s

output

, which feeds the bulk voltage for the logic gates, is

fed back to the bulk of the threshold balancing detector to force

the pMOS/nMOS

balancing. For instance, if the nMOS is

leakier than the pMOS,

will decrease, triggering a much

larger drop on

. This drop will make the nMOS increase its

and the pMOS decrease its , such that the process-corner

imbalance is mitigated. In our design, the power switch

transistors

, and are nMOS transistors overdriven by a

boosted gate voltage. Hence, their

is small enough to avoid

the potential drop across a transistor. The boosted gate voltage

can be obtained either from other high voltage domains or from

the periphery I/O power rails.

We use a metric

to represent the

imbalance. In fact, depicts how far deviates from

due to unbalanced devices. The larger is, the larger

the

imbalance is. Fig. 3(a) shows the simulated range of

, with and without our balancing scheme. As can be seen,

the imbalance between

of pMOS and nMOS transistors is

conﬁned to a much tighter range after

balancing. Fig. 3(b)

shows the Monte Carlo simulated propagation delay for an in-

verter with aspect ratio of

m m to drive

a capacitive load of 5 fF at

mV in the CMOS 65

nm

process. After balancing, the average propagation

delay of the inverter is reduced from 14 ns to 10 ns. This speed

improvement is because both the p/nMOS transistors are for-

ward-biased when the balancer is turned on. Most importantly,

the standard deviation

is reduced by and the is re-

duced by

when the proposed conﬁgurable balancer is

used, as an exceedingly long rising/falling time is avoided.

B. Improving Driving Capability by Exploiting Parallel

Mismatch in Power Switches

Even though

mismatch is known to be catastrophic for

circuit functionality, we have developed an interesting approach

to improve sub/near threshold current drivability by exploiting

the

mismatch between parallel transistors. Our approach is

based on the theoretical proof and simulation results that show

that in the subthreshold regime the

mismatch between par-

allelized transistors always results in an increased mean driving

current. This interesting property has been applied to the power-

switches of the

balancer circuit.

Suppose

, are the mean and standard deviation

of

of an nMOS transistor as shown in Fig. 4(a). Considering

(2)

(3)

PU et al.: ULTRA-LOW-ENERGY MULTI-STANDARD JPEG CO-PROCESSOR IN 65 nm CMOS WITH SUB/NEAR THRESHOLD SUPPLY VOLTAGE 671

Fig. 3. (a) Simulated

3



range of



. (b) Propagation delay for an inverter in 65 nm CMOS from Monte Carlo simulation (

W =W

=1

:

1



m

=

0

:

40



m,

C =

5

fF).

the intra-die variation of a single transistor modeled as in

[22], we have

(1)

where

is a technology conversion constant (in mV m),

and WL is the transistor’s active area. Since

follows a

normal distribution, the transistor’s on-current

follows a

log-normal distribution in sub-threshold. Using the properties

of a log-normal distribution, the mean value and standard

deviation of

are as shown in (2) and (3) at the bottom of the

previous page, where

is the gate source voltage, the in-

trinsic thermal voltage, and

the junction gradient coefﬁcient.

Suppose the transistor is equally divided in

-parallel nMOS

transistors,

[see Fig. 4(b)]. Without loss of generality,

let us denote the mean and standard deviation of the threshold

voltage of any of these parallel transistors

as

(4)

(5)

(7)

(8)

672 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 3, MARCH 2010

Fig. 4. (a) nMOS transistor with aspect ratio (W, L); (b) N-parallelized nMOS

transistors with aspect ratio (W/N, L).

where

(6)

Then, the mean value of the total subthreshold current

in Fig. 4(b) is obtained as shown in (7) and (8) at the bottom of

the previous page. Comparing (1) and (6), and since

,we

have that

(9)

Then, by comparing (2) and (7), we obtain

(10)

As can be seen, dividing a large transistor into smaller paral-

lelized transistors helps to increase the subthreshold current

due to larger

mismatch. We also did Monte Carlo simu-

lations to conﬁrm the effectiveness of this approach. As way

of reference assume an

nMOS transistor with aspect

ratio

m m, divided in -transis-

tors

, with its gate voltage and

drain-to-source voltage

set at 200 mV. The reason why

200 mV

and is chosen, is because in the bal-

ancer the

and of the power switches operating in the

subthreshold regime is approximately 200 mV (half of 400 mV

). Since the power switches’ output will forward bias the

bulk of p/n transistors in digital blocks, a close to

200 mV

output voltage is the right magnitude which can bring

unbalance from ; deviation to typical value without incurring

too much excessive leakage current. The simulated mean and

standard deviation values of the effective driving current

are

listed in Table II. As seen, the larger the number of segments

,

the larger the

mismatch, consequently the larger the mean

subthreshold driving current. However, Table II also shows an

increasing driving current variability and larger

as the transistor becomes narrower. According to (8), this is

due to an increased

shift caused by narrow width effects.

To mitigate such effect, instead of dividing all transistors into

minimal width transistors, our design constrained the transistor

width to be not smaller than a certain limit. By constraining a

maximum

20%, a same driving current can

be achieved with approximately 10% transistor area reduction.

In addition, the multi-ﬁnger layout can avoid a very strange

aspect-ratio and easily ﬁt into the layout of the other devices

hence making the entire layout more compact.

TABLE II

M

EAN AND STANDARD DEVIATION OF DRIVING CURRENT

C. Sub-Threshold Library Selection

The standard library cells optimized for super-threshold

design must be revised for reliable logic synthesis. The cells

having a large effective driving current variability will have a

remarkably low yield. We identiﬁed these cells through Monte

Carlo simulations and ﬁltered them out before logic synthesis.

The metric we used is that, after applying

balancing,

the cells that have

20% at

400 mV, are eliminated, where is the leakage

current for off-transistors. These cells have some typical struc-

tures:

1) More Than Four Parallel Transistors and More Than Four

Stacked Transistors: The standard cells are composed of narrow

transistors to increase area efﬁciency. As the number of parallel

transistors and the number of stacked-transistors increases, the

leakage current variability increases dramatically, as shown in

Section II-B. We simply discarded logic gates with more than

four parallel transistors or more than four stacked transistors,

such as 4-input NAND and NOR gates.

2) Ratioed Logic: Ratioed logic can reduce the number of

transistors required to implement a given logic function, but

it must be sized carefully to guarantee that the active current

is stronger than the static current. Therefore, the correct func-

tioning of ratioed logic cells depends largely on the sizing. In

the subthreshold region, the largest current variability is due to

variation. Even a small variation on has a heavy impact

on the active or static current. Therefore, logic cells totally re-

lying on transistor sizing are dangerous and should be avoided.

3) Feedback Logic: Feedback logic is a special type of ra-

tioed logic which uses positive feedback loops to help change

the logic values. Due to

variation, the output of the logic can

have stuck-high or stuck-low failures and thus never ﬂip.

D. Turning Ratioed Logic Into Non-Ratioed Logic

Latches and registers are the feedback logic that must be used

in sequential circuits. To reduce loading on clock net and ease

ultrahigh speed designs, some latches/registers use weak but al-

ways-on feedback inverters. Fig. 5 shows how to turn them into

non-ratioed logic. By using the clk and

signals, we prevent

the slave inverters

from directly cross-coupling with the

master inverters

. As a result, when writing into the latch,

the slave inverter is always disabled, so the writing to the master

inverter is facilitated. After the writing is done, the slave inverter

is enabled to help maintain the logic value. Therefore, the race

An Ultra-Low-Energy Multi-Standard JPEG Co-Processor in 65 nm CMOS With Sub/Near Threshold Supply Voltage

Figures

Citations

Ultra-Low Power VLSI Circuit Design Demystified and Explained: A Tutorial

A 62 mV 0.13 $\mu$ m CMOS Standard-Cell-Based Design Technique Using Schmitt-Trigger Logic

Fully-Integrated On-Chip DC-DC Converter With a 450X Output Range

Fully Integrated Capacitive DC–DC Converter With All-Digital Ripple Mitigation Technique

Multi-core architecture design for ultra-low-power wearable health monitoring systems

References

The JPEG still picture compression standard

The JPEG still picture compression standard

Matching properties of MOS transistors

Matching properties of MOS transistors

A 180-mV subthreshold FFT processor using a minimum energy design methodology

Related Papers (5)

A 180-mV subthreshold FFT processor using a minimum energy design methodology

A 65 nm Sub- $V_{t}$ Microcontroller With Integrated SRAM and Switched Capacitor DC-DC Converter

Analysis and mitigation of variability in subthreshold design

A 32 kb 10T Sub-Threshold SRAM Array With Bit-Interleaving and Differential Read Scheme in 90 nm CMOS

A 2.60pJ/Inst Subthreshold Sensor Processor for Optimal Energy Efficiency