scispace - formally typeset
Open AccessJournal ArticleDOI

An Ultra-Low-Energy Multi-Standard JPEG Co-Processor in 65 nm CMOS With Sub/Near Threshold Supply Voltage

TLDR
A design technique for (near) subthreshold operation that achieves ultra low energy dissipation at throughputs of up to 100 MB/s suitable for digital consumer electronic applications and is largely applicable to designing other sound/graphic and streaming processors.
Abstract
We present a design technique for (near) subthreshold operation that achieves ultra low energy dissipation at throughputs of up to 100 MB/s suitable for digital consumer electronic applications. Our approach employs i) architecture-level parallelism to compensate throughput degradation, ii) a configurable V T balancer to mitigate the V T mismatch of nMOS and pMOS transistors operating in sub/near threshold, and iii) a fingered-structured parallel transistor that exploits V T mismatch to improve current drivability. Additionally, we describe the selection procedure of the standard cells and how they were modified for higher reliability in the subthreshold regime. All these concepts are demonstrated using SubJPEG, a 1.4 ×1.4 mm2 65 nm CMOS standard-V T multi-standard JPEG co-processor. Measurement results of the discrete cosine transform (DCT) and quantization processing engines, operating in the subthreshold regime, show an energy dissipation of only 0.75 pJ per cycle with a supply voltage of 0.4 V at 2.5 MHz. This leads to 8.3× energy reduction when compared to using a 1.2 V nominal supply. In the near-threshold regime the energy dissipation is 1.0 pJ per cycle with a 0.45 V supply voltage at 4.5 MHz. The system throughput can meet 15 fps 640 × 480 pixel VGA standard. Our methodology is largely applicable to designing other sound/graphic and streaming processors.

read more

Content maybe subject to copyright    Report

668 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 3, MARCH 2010
An Ultra-Low-Energy Multi-Standard JPEG
Co-Processor in 65 nm CMOS With
Sub/Near Threshold Supply Voltage
Yu Pu, Member, IEEE, Jose Pineda de Gyvez, Fellow, IEEE, Henk Corporaal, Member, IEEE, and
Yajun Ha, Senior Member, IEEE
Abstract—We present a design technique for (near) subthreshold
operation that achieves ultra low energy dissipation at through-
puts of up to 100 MB/s suitable for digital consumer electronic ap-
plications. Our approach employs i) architecture-level parallelism
to compensate throughput degradation, ii) a configurable
T
bal-
ancer to mitigate the
T
mismatch of nMOS and pMOS transis-
tors operating in sub/near threshold, and iii) a fingered-structured
parallel transistor that exploits
T
mismatch to improve current
drivability. Additionally, we describe the selection procedure of
the standard cells and how they were modified for higher relia-
bility in the subthreshold regime. All these concepts are demon-
strated using
SubJPEG,a
1 4 1 4
mm
2
65 nm CMOS stan-
dard-
T
multi-standard JPEG co-processor. Measurement results
of the discrete cosine transform (DCT) and quantization processing
engines, operating in the subthreshold regime, show an energy dis-
sipation of only 0.75 pJ per cycle with a supply voltage of 0.4 V at
2.5 MHz. This leads to
8 3
energy reduction when compared to
using a 1.2 V nominal supply. In the near-threshold regime the en-
ergy dissipation is 1.0 pJ per cycle with a 0.45 V supply voltage at
4.5 MHz. The system throughput can meet 15 fps 640
480 pixel
VGA standard. Our methodology is largely applicable to designing
other sound/graphic and streaming processors.
Index Terms—JPEG, parallel architecture, sub-threshold, ultra
low energy.
I. INTRODUCTION
W
ITH the ever-shrinking feature size, the number of
transistors integrated in one digital core doubles ap-
proximately every two years. The increasing transistor density
greatly challenges the limited battery life and thermal properties
of the IC. Exploring a design methodology for ultra low-energy,
“green” digital circuits is thus very important. One of the most
effective means to achieve these goals is to scale the supply
voltage
along with the operating frequency. As
scales, not only does the dynamic energy reduce quadratically,
but also the leakage current does reduce super-linearly due to
the drain-induced barrier-lowering (DIBL) effect. Therefore,
Manuscript received June 24, 2009; revised September 09, 2009. Current ver-
sion published February 24, 2010. This paper was approved by Associate Editor
Bevan Baas.
Y. Pu was with the Ultra Low Power DSP Processor Group, IMEC-NL, 5656
AE Eindhoven, The Netherlands, and is now with the Sakurai Lab, University
of Tokyo, Tokyo 153-8505, Japan (e-mail: y.pu@tue.nl).
J. Pineda de Gyvez is with NXP Semiconductors, 5656 AE Eindhoven, The
Netherlands.
H. Corporaal is with the Faculty of Electrical Engineering, Eindhoven Uni-
versity of Technology, 5612 AZ Eindhoven, The Netherlands.
Y. Ha is with the Department of Electrical and Computer Engineering, Na-
tional University of Singapore, 117576 Singapore.
Digital Object Identifier 10.1109/JSSC.2009.2039684
the total energy dissipation of a circuit can considerably be
reduced. In addition,
scaling reduces transient current
spikes, hence lowering the notorious ground bounce noise.
This also helps to improve the performance of sensitive analog
circuits on the chip, such as delay-lock loops (DLL), which are
crucial for the functioning of large digital circuits.
In contrast to analog circuit design where lowering the
to the subthreshold region is generally avoided because of the
small values of the driving currents and the exceedingly large
noise, CMOS digital logic gates can work seamlessly from full
to well below threshold voltage . Theoretically, oper-
ating digital circuits in the near/sub-threshold region
can help obtain huge energy savings. However, the design
rules provided by foundries normally set 2/3 of the full
as the practical limitation for scaling. Taking Samsung’s
DVFS Design Technology [1] and TSMC’s design rules as ex-
amples, the constraint of
for digital circuits designed in
CMOS 65 nm Standard
Process is in the 1.2
range. The reasoning behind the limitation is twofold. First,
as
scales, the driving capability of transistors reduces ac-
cordingly. Most consumer electronic applications need oper-
ating frequencies in the range of tens of MHz to reach cer-
tain throughput, which might not be fulfilled with aggressive
scaling. Second, digital circuits become particularly sensi-
tive to process variations when
scales below 2/3 full .
Process variations are likely to cause malfunctioning, and both
the timing yield and functional yield may tremendously de-
crease. As a result,
is generally chosen to maintain an
adequate margin to prevent high yield loss and to keep quality
according to industrial standards. The goal of our work is to
safely evade this limitation so as to enable wide range voltage
scaling, from nominal supply to near/sub threshold.
Sub/near threshold techniques have been explored in recent
years. Fig. 1 shows a comparison of the computation effi-
ciency (GOPS/W) and throughput (MOPS) of our
SubJPEG
co-processor and other existing subthreshold processors. Like-
wise, Table I summarizes the most relevant work in the field.
In contrast to the work presented in those publications, our
work has some unique features. Firstly, we explore the use
of architecture-level parallelism to compensate throughput
degradation at ultra-low supply values. Parallelism along with
sub/near threshold techniques is best suited for low-energy
and medium frequency applications, such as mobile image
processing. Secondly, this work proposes a configurable
balancer to lessen the mismatch between nMOS and pMOS
transistors, such that both the functional and the timing yield
0018-9200/$26.00 © 2010 IEEE

PU et al.: ULTRA-LOW-ENERGY MULTI-STANDARD JPEG CO-PROCESSOR IN 65 nm CMOS WITH SUB/NEAR THRESHOLD SUPPLY VOLTAGE 669
Fig. 1. Computation efficiency and throughput of this work and other works.
TABLE I
S
UMMARY OF
EXISTING SUB-THRESHOLD
WORK
are increased. Thirdly, we make use of design approaches that
exploit parallel-transistor
mismatch to improve drivability
in power switches, and of design strategies that select a reliable
cell library for logic synthesis, and that turn ratioed logic into
non-ratioed logic to improve the robustness of our design in
the subthreshold regime. To demonstrate these ideas, we have
designed and implemented a 65 nm
CMOS ultra-low
energy multi-standard JPEG co-processor.
The remainder of this paper is organized as follows. Section II
presents the physical-level effort we have made for an enhanced
circuit yield. In Section III, the architecture of SubJPEG is in-
troduced in detail. Section IV presents key design issues and
the evaluation results of the prototype chip. Finally, Section V
draws conclusions of this work.
II. P
HYSICAL LEVEL EFFORT FOR AN ENHANCED YIELD
A. Configurable
Balancer
mismatch dominates the subthreshold current variation
due to its exponential correlation to the current. Since tran-
sistor
is controlled by an independent doping process,
pMOS/nMOS
can vary significantly with respect to each
other. Consequently, this variability can result in lower circuit
yield. For example, at the fast nMOS slow pMOS corner
(FNSP) where the nMOS network is much leakier than the
pMOS network, a sufficiently high output voltage
may
not be reached. Similarly, an insufficiently low voltage
can happen when at the fast pMOS slow nMOS corner (SNFP).
Even if the noise margin can be met, either the rising or falling
time becomes exceedingly long at process corners, which also
dramatically deteriorates the timing yield. Therefore, it is very
important to balance the
of pMOS and nMOS transistors.
We propose a configurable
balancing scheme (Fig. 2),
which enables ultra wide range
scaling from the nominal
supply voltage to sub-threshold. This configurable
balancer
is an extension of our previous work [20]. Our
balancer
is also different from the regulator presented in [21] since
it uses an imbalance detector which has a better sensitivity.
Also, it uses an amplifier in the feedback loop to enhance the
sensitivity, and, it is configurable to support wide
tuning.
Let us address now the operation of our
balancer. When the
processor works in the super-threshold mode,
is off such that
the tri-state buffer is configured to be in a high impedance state.
Since the power switch transistors
and are on, and ,

670 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 3, MARCH 2010
Fig. 2. Proposed configurable
V
balancer.
are off, the bulk of the pMOS transistors is connected to ,
and the bulk of the nMOS transistors is connected to
.
When the processor is configured to work in the subthreshold
mode,
is on, and thus the tri-state buffer is functional. In this
mode,
, are on, and , are off. Therefore, the buffer’s
output voltage passes through
, and to feed the bulk of
the logic gates. A CMOS inverter, whose pMOS and nMOS
transistors are off, functions as a process-corner
imbalance
detector. Observe that
is never higher than pre-
venting in this way the junction diodes from turning on in the
P-well and N-well under control.
and are designed
in advance to be at
in the typical process corner (TT).
fluctuates with the variations of process and temperature.
The buffer detects and amplifies the swing of
. The buffer’s
output
, which feeds the bulk voltage for the logic gates, is
fed back to the bulk of the threshold balancing detector to force
the pMOS/nMOS
balancing. For instance, if the nMOS is
leakier than the pMOS,
will decrease, triggering a much
larger drop on
. This drop will make the nMOS increase its
and the pMOS decrease its , such that the process-corner
imbalance is mitigated. In our design, the power switch
transistors
, and are nMOS transistors overdriven by a
boosted gate voltage. Hence, their
is small enough to avoid
the potential drop across a transistor. The boosted gate voltage
can be obtained either from other high voltage domains or from
the periphery I/O power rails.
We use a metric
to represent the
imbalance. In fact, depicts how far deviates from
due to unbalanced devices. The larger is, the larger
the
imbalance is. Fig. 3(a) shows the simulated range of
, with and without our balancing scheme. As can be seen,
the imbalance between
of pMOS and nMOS transistors is
confined to a much tighter range after
balancing. Fig. 3(b)
shows the Monte Carlo simulated propagation delay for an in-
verter with aspect ratio of
m m to drive
a capacitive load of 5 fF at
mV in the CMOS 65
nm
process. After balancing, the average propagation
delay of the inverter is reduced from 14 ns to 10 ns. This speed
improvement is because both the p/nMOS transistors are for-
ward-biased when the balancer is turned on. Most importantly,
the standard deviation
is reduced by and the is re-
duced by
when the proposed configurable balancer is
used, as an exceedingly long rising/falling time is avoided.
B. Improving Driving Capability by Exploiting Parallel
Mismatch in Power Switches
Even though
mismatch is known to be catastrophic for
circuit functionality, we have developed an interesting approach
to improve sub/near threshold current drivability by exploiting
the
mismatch between parallel transistors. Our approach is
based on the theoretical proof and simulation results that show
that in the subthreshold regime the
mismatch between par-
allelized transistors always results in an increased mean driving
current. This interesting property has been applied to the power-
switches of the
balancer circuit.
Suppose
, are the mean and standard deviation
of
of an nMOS transistor as shown in Fig. 4(a). Considering
(2)
(3)

PU et al.: ULTRA-LOW-ENERGY MULTI-STANDARD JPEG CO-PROCESSOR IN 65 nm CMOS WITH SUB/NEAR THRESHOLD SUPPLY VOLTAGE 671
Fig. 3. (a) Simulated
3
range of
. (b) Propagation delay for an inverter in 65 nm CMOS from Monte Carlo simulation (
W =W
=1
:
1
m
=
0
:
40
m,
C =
5
fF).
the intra-die variation of a single transistor modeled as in
[22], we have
(1)
where
is a technology conversion constant (in mV m),
and WL is the transistor’s active area. Since
follows a
normal distribution, the transistor’s on-current
follows a
log-normal distribution in sub-threshold. Using the properties
of a log-normal distribution, the mean value and standard
deviation of
are as shown in (2) and (3) at the bottom of the
previous page, where
is the gate source voltage, the in-
trinsic thermal voltage, and
the junction gradient coefficient.
Suppose the transistor is equally divided in
-parallel nMOS
transistors,
[see Fig. 4(b)]. Without loss of generality,
let us denote the mean and standard deviation of the threshold
voltage of any of these parallel transistors
as
(4)
(5)
(7)
(8)

672 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 3, MARCH 2010
Fig. 4. (a) nMOS transistor with aspect ratio (W, L); (b) N-parallelized nMOS
transistors with aspect ratio (W/N, L).
where
(6)
Then, the mean value of the total subthreshold current
in Fig. 4(b) is obtained as shown in (7) and (8) at the bottom of
the previous page. Comparing (1) and (6), and since
,we
have that
(9)
Then, by comparing (2) and (7), we obtain
(10)
As can be seen, dividing a large transistor into smaller paral-
lelized transistors helps to increase the subthreshold current
due to larger
mismatch. We also did Monte Carlo simu-
lations to confirm the effectiveness of this approach. As way
of reference assume an
nMOS transistor with aspect
ratio
m m, divided in -transis-
tors
, with its gate voltage and
drain-to-source voltage
set at 200 mV. The reason why
200 mV
and is chosen, is because in the bal-
ancer the
and of the power switches operating in the
subthreshold regime is approximately 200 mV (half of 400 mV
). Since the power switches’ output will forward bias the
bulk of p/n transistors in digital blocks, a close to
200 mV
output voltage is the right magnitude which can bring
unbalance from ; deviation to typical value without incurring
too much excessive leakage current. The simulated mean and
standard deviation values of the effective driving current
are
listed in Table II. As seen, the larger the number of segments
,
the larger the
mismatch, consequently the larger the mean
subthreshold driving current. However, Table II also shows an
increasing driving current variability and larger
as the transistor becomes narrower. According to (8), this is
due to an increased
shift caused by narrow width effects.
To mitigate such effect, instead of dividing all transistors into
minimal width transistors, our design constrained the transistor
width to be not smaller than a certain limit. By constraining a
maximum
20%, a same driving current can
be achieved with approximately 10% transistor area reduction.
In addition, the multi-finger layout can avoid a very strange
aspect-ratio and easily fit into the layout of the other devices
hence making the entire layout more compact.
TABLE II
M
EAN AND STANDARD DEVIATION OF DRIVING CURRENT
C. Sub-Threshold Library Selection
The standard library cells optimized for super-threshold
design must be revised for reliable logic synthesis. The cells
having a large effective driving current variability will have a
remarkably low yield. We identified these cells through Monte
Carlo simulations and filtered them out before logic synthesis.
The metric we used is that, after applying
balancing,
the cells that have
20% at
400 mV, are eliminated, where is the leakage
current for off-transistors. These cells have some typical struc-
tures:
1) More Than Four Parallel Transistors and More Than Four
Stacked Transistors: The standard cells are composed of narrow
transistors to increase area efficiency. As the number of parallel
transistors and the number of stacked-transistors increases, the
leakage current variability increases dramatically, as shown in
Section II-B. We simply discarded logic gates with more than
four parallel transistors or more than four stacked transistors,
such as 4-input NAND and NOR gates.
2) Ratioed Logic: Ratioed logic can reduce the number of
transistors required to implement a given logic function, but
it must be sized carefully to guarantee that the active current
is stronger than the static current. Therefore, the correct func-
tioning of ratioed logic cells depends largely on the sizing. In
the subthreshold region, the largest current variability is due to
variation. Even a small variation on has a heavy impact
on the active or static current. Therefore, logic cells totally re-
lying on transistor sizing are dangerous and should be avoided.
3) Feedback Logic: Feedback logic is a special type of ra-
tioed logic which uses positive feedback loops to help change
the logic values. Due to
variation, the output of the logic can
have stuck-high or stuck-low failures and thus never flip.
D. Turning Ratioed Logic Into Non-Ratioed Logic
Latches and registers are the feedback logic that must be used
in sequential circuits. To reduce loading on clock net and ease
ultrahigh speed designs, some latches/registers use weak but al-
ways-on feedback inverters. Fig. 5 shows how to turn them into
non-ratioed logic. By using the clk and
signals, we prevent
the slave inverters
from directly cross-coupling with the
master inverters
. As a result, when writing into the latch,
the slave inverter is always disabled, so the writing to the master
inverter is facilitated. After the writing is done, the slave inverter
is enabled to help maintain the logic value. Therefore, the race

Citations
More filters
Journal ArticleDOI

Ultra-Low Power VLSI Circuit Design Demystified and Explained: A Tutorial

TL;DR: It is shown that many paradigms and approaches borrowed from traditional above-threshold low-power VLSI design are actually incorrect and common misconceptions in the ULP domain are debunked and replaced with technically sound explanations.
Journal ArticleDOI

A 62 mV 0.13 $\mu$ m CMOS Standard-Cell-Based Design Technique Using Schmitt-Trigger Logic

TL;DR: The effective on-to-off ratio can be considerably improved by the use of Schmitt Trigger structures, which effectively reduce the leakage from the gate output node and thereby stabilize the output level.
Journal ArticleDOI

Fully-Integrated On-Chip DC-DC Converter With a 450X Output Range

TL;DR: The design, implemented in IBM 130 nm CMOS technology, achieves a peak efficiency of 77% at reduced temperature of 8°C and has a maximum efficiency of 74.5% under normal operating conditions.
Journal ArticleDOI

Fully Integrated Capacitive DC–DC Converter With All-Digital Ripple Mitigation Technique

TL;DR: This paper presents an adaptive all-digital ripple mitigation technique for fully integrated capacitive dc-dc converters using a two-pronged approach where coarse ripple control is achieved by varying the size of the bucket capacitance, and fine control is achieve by charge/discharge time modulation of the Bucket capacitors used to transfer the charge between the input and output.
Proceedings ArticleDOI

Multi-core architecture design for ultra-low-power wearable health monitoring systems

TL;DR: This work proposes a near-threshold ultra-low-power multi-core architecture featuring low-power cores, yet capable of executing biomedical applications, with multiple instruction and data memories, tightly coupled through flexible crossbar interconnects.
References
More filters
Journal ArticleDOI

The JPEG still picture compression standard

TL;DR: The Baseline method has been by far the most widely implemented JPEG method to date, and is sufficient in its own right for a large number of applications.
Journal ArticleDOI

The JPEG still picture compression standard

TL;DR: The author provides an overview of the JPEG standard, and focuses in detail on the Baseline method, which has been by far the most widely implemented JPEG method to date, and is sufficient in its own right for a large number of applications.
Journal ArticleDOI

Matching properties of MOS transistors

TL;DR: In this paper, the matching properties of the threshold voltage, substrate factor, and current factor of MOS transistors have been analyzed and measured, and the matching results have been verified by measurements and calculations on several basic circuits.
Journal ArticleDOI

Matching properties of MOS transistors

TL;DR: In this article, the matching properties of the threshold voltage, substrate factor and current factor of MOS transistors have been analyzed and measured, and the matching results have been verified by measurements and calculations on a band-gap reference circuit.
Journal ArticleDOI

A 180-mV subthreshold FFT processor using a minimum energy design methodology

TL;DR: New subthreshold logic and memory design methodologies are developed and demonstrated on a fast Fourier transform (FFT) processor that is designed to investigate the estimated minimum energy point.
Related Papers (5)