scispace - formally typeset

Journal ArticleDOI

124 MSamples/s Pixel-Pipelined Motion-JPEG 2000 Codec Without Tile Memory

01 Apr 2007-IEEE Transactions on Circuits and Systems for Video Technology (IEEE)-Vol. 17, Iss: 4, pp 398-406

TL;DR: A level-switched DWT (LS-DWT) and the code-block switched EBC (CS-EBC) are developed to enable this scheduling to eliminate tile memory and the DWT and the EBC are pipelined at pixel-level.
Abstract: A 124 MSamples/s JPEG 2000 codec is implemented on a 20.1 mm2 die with 0.18 mum CMOS technology dissipating 385 mW at 1.8 V and 42 MHz. This chip is capable of processing 1920times1080 HD video at 30 fps. For previous works, the tile-level pipeline scheduling is used between the discrete wavelet transform (DWT) and embedded block coding (EBC). For a tile with size 256times256, it costs 175 kB on-chip SRAM for the architectures using on-chip tile memory or costs 310 MB/s SDRAM bandwidth for the architectures using off-chip tile memory. In this design, a level-switched scheduling is developed to eliminate tile memory and the DWT and the EBC are pipelined at pixel-level. This scheduling eliminates 175 kB on-chip SRAM and 310 MB/s off-chip SDRAM bandwidth. The level-switched DWT (LS-DWT) and the code-block switched EBC (CS-EBC) are developed to enable this scheduling. The codec functions are realized on an unified hardware, and hardware sharing between encoder and decoder reduces silicon area by 40%
Topics: Codec (54%), JPEG 2000 (53%), Static random-access memory (52%), Motion JPEG (50%)

Summary (4 min read)

Introduction

  • Index Terms—HD video, image compression, JPEG 2000.
  • There are three challenges in the design of efficient JPEG 2000 codec for HD video.
  • First, the large data rate between the DWT and the EBC requires either large on-chip SRAM or high SDRAM bandwidth.
  • For the conventional architectures, tile-level pipeline scheduling, i.e., DWT and EBC pipelined at tile-level, is used due to two critical problems.
  • The encoding and decoding functions are implemented on an unified hardware with little overhead for the control circuits.

II. JPEG 2000 OVERVIEW

  • The image is partitioned into tiles, which are independently coded.
  • Each tile is decomposed by the DWT into subbands with certain decomposition levels.
  • Seven subbands are generated with two decomposition levels.
  • Each subband is further partitioned into code-blocks, and each code-block is independently encoded by the EBC.

A. Discrete Wavelet Transform

  • In JPEG 2000, each tile is transformed by an multi-level and two-dimensional (2-D) DWT.
  • Note that the denotes the original tile and the numbered circles denote the output order of each coefficient in each subband.
  • In each level, a 2-D DWT can be factorized with two one-dimensional (1-D) DWT.
  • The HL (LH) band is obtained by high-pass filtering in the direction and low-pass filtering in the direction.
  • For the inverse-transformed DWT, the procedure is a reverse of the procedure for the forward DWT, i.e., , , , and compose subband.

B. Embedded Block Coding

  • The Embedded Block Coding (EBC) algorithm contains context formation and context-adaptive arithmetic coder.
  • Fig. 4(a) and (b) shows the block diagram of the EBC algorithms for the encoder and decoder, respectively.
  • For the EBC in the encoder, the context formation generates a pair of context and decision bit, and the context-adaptive arithmetic encoder generates embedded bit streams.
  • The DWT coefficients in a code-block are sign-magnitude represented, and are encoded or decoded from the Most Significant Bit (MSB) bit-plane to the Least Significant Bit (LSB) bitplane.
  • Each bit-plane is scanned by three coding passes, Pass 1 (significant propagation pass), Pass 2 (magnitude refinement pass), and Pass 3 (clean-up pass).

III. LEVEL-SWITCHED SCHEDULING

  • There are two critical problems to design an efficient and high-throughput JPEG 2000 system.
  • By use of this scheduling, the tile memory can be eliminated at a cost of a little additional memory buffer for the DWT and the EBC.
  • The arithmetic coder is terminated at end of each coding pass and the samples that come from the next stripe are considered insignificant.
  • The dataflow of the DWT is designed to match the stripe scan of the EBC, and the EBC is designed to be capable of processing one coefficient per cycle to match the the word-level throughput of the DWT.
  • The DWT and the EBC switch to level 2 decomposition to process computation state 8 as soon as the buffered coefficients are enough for a computation state.

IV. JPEG 2000 CODEC ARCHITECTURE

  • Fig. 7 shows the block diagram of the codec.
  • It contains a main controller, a 3-level DWT module, three embedded block coding (EBC) modules, a rate-distortion optimization (RDO) controller, and a bit stream controller (BSC).
  • Both the DWT and the EBC are pixel-pipelined such that no tile memory is required between the DWT and the EBC.
  • Moreover, both the encoding and the decoding are one-pass, that is, no coefficient transmission to or from SDRAM.
  • The detailed architectures are elaborated in the following sections.

A. Level-Switched DWT Architecture

  • It contains two 1-D DWT modules, an LL-band buffer, and an inter-level line buffer.
  • The direction of arrows shows the dataflow of the forward transformation.
  • The DWT architecture in [6] uses a line-buffer to buffer the partially transformed coefficients [14] to avoid multiple accesses for the coefficients in the column direction and uses nonoverlapped stripe scan to eliminate the line-buffer in the row direction.
  • By using line-based architecture, only one read for each pixel is required, which is the theoretical lower bound.
  • To reduce the memory requirement for the LL-band buffer, 8, 16, and 8 lines are used to buffer , , and coefficients, respectively.

B. Code-Block Switched EBC Architecture

  • The EBC is the throughput bottleneck of a high-performance JPEG 2000 codec.
  • Therefore, the probability state memory of EBC requires 19950 bits ( bits kB).
  • To ensure that the EBC processes one coefficient per cycle, the four-symbol arithmetic coder (FAC) is designed to be capable of processing all the contexts generated from a bit plane in one cycle.
  • The architecture of general arithmetic coder is modified from the encoder architecture proposed in [4] to achieve codec function by reconfiguring its datapath.
  • The FAC can operate at one-symbol, twosymbol, or four-symbol mode by the multiplexing control.

C. Rate-Distortion Optimization

  • The RDO controller adopts post-compression rate-distortion optimization scheme, which determines truncation points for each code-block at the end of coding a tile according to target bit-rate.
  • The rate and distortion (R-D) for each coding pass of each code-block are accurately calculated.
  • The RDO controller uses an R-D register bank and an R-D memory to buffer the rate and distortion information for the current code-block and each unfinished code-block, respectively.
  • The control scheme for the register bank and memory is the same as that for the PSRB and state memory in CS-EBC.
  • After the finish of the last computation state for the previous tile, the RDO controller determines the truncation points and passes decisions to the BSC.

V. HARDWARE SHARING TECHNIQUES

  • To reduce hardware cost, two hardware sharing techniques are developed to design the codec.
  • First, the level-switched scheduling for encoder and decoder have inverse-matched switching characteristics to achieve 100% memory sharing for the LS-DWT and the CS-EBC.
  • Therefore, a large portion of the processing elements such as multipliers and adders can be shared for the forward and backward transformation by using the multiplexed multiplicators.
  • Therefore, reconfigurability can be achieved with little control overhead.
  • By reconfiguring the datapath, an arithmetic coder can save 17K gates compared with separate arithmetic encoder and arithmetic decoder.

A. Chip Implementation and Features

  • The single-chip JPEG 2000 codec is implemented on a 20.1-mm die using TSMC 0.18- m CMOS one-poly six-metal (1P6M) technology and has been received on September 2005.
  • The die micrograph is shown in Fig. 11 and Table I shows the features of this chip.
  • This prototype only supports tile size 256 256, code-block size 64 64 and three-level decomposition.
  • The authors did not implement other control schemes in this chip.
  • The power consumption is 385 mW at 1.8 V and 42 MHz for lossless encoding and decoding.

B. Testing Result

  • This chip is fully tested by extensive test patterns.
  • The chip works as expected and can correctly encode or decode images.
  • The measured timing versus various supply voltage are shown in Fig. 12.
  • By observing Fig. 12, at 1.8 V supply voltage, the chip can work at a frequency higher than 42 MHz and the supply voltage could be scaled down to 1.7 V while maintaining target specification.

C. Effectiveness of the Level-Switched Scheduling

  • Fig. 13 shows the effectiveness of the proposed level-switched scheduling on the reduction of memory requirements and external memory bandwidth.
  • Therefore, the DWT and the EBC are pipelined at tile-level by using off-chip tile memory.
  • The 5.7 kB memory includes the line buffer for one level used in the DWT, state memory, and line buffer for one code-block used in the EBC as well as other usages such as bit-streams buffer and rate-distortion buffer for the RDO.
  • The SDRAM bandwidth can be reduced to 37% of the original one by embedding the tile memory.
  • The on-chip memory is too large such that dramatically increases the silicon cost.

D. Effectiveness of Hardware Sharing

  • Fig. 14 shows the effectiveness of cost reduction by using two hardware sharing techniques.
  • It shows the logic gate counts and the memory requirement to implement an encoder, a decoder, and a codec.
  • With sharing techniques, the logic gate counts of the DWT in the codec is about 50% larger than those of the encoder or decoder.
  • The EBC in the decoder has larger logic gate counts due to the fact that the PCF module is much more complex than that in the encoder.
  • As well, the shared BSC between encoder and decoder also saves 121K logic gates.

E. Comparison

  • The comparisons with the previous works are summarized in Table III.
  • The works of ADI and Sanyo use off-chip tile memory.
  • The authors use a performance index (PI), defined as throughput per unit area at 1 MHz, to make a comparison for the existing works.
  • Moreover, the SDRAM bandwidth of this chip is 280 MB/s less than that of [23] and [24].
  • For [8], the codec functions are implemented on a unified hardware to achieve 60 MS/s and 20 MS/s for encoder and decoder, respectively.

VII. CONCLUSION

  • A JPEG 2000 single chip codec is presented.
  • Both encoding and decoding functions achieve 124 MS/s data rate.
  • The level-switched scheduling reduces 175 kB on-chip memory for the architectures using on-chip tile memory, and 310 MB/s SDRAM bandwidth for the architectures using off-chip tile memory.
  • It matches the dataflow and throughput of the LS-DWT and the CS-EBC to eliminate tile memory.
  • Second, filter core with multiplexed coefficients and reconfigurable arithmetic coder save 489K logic gates.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

398 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 4, APRIL 2007
124 MSamples/s Pixel-Pipelined Motion-JPEG 2000
Codec Without Tile Memory
Yu-Wei Chang, Chih-Chi Cheng, Chun-Chia Chen, Hung-Chi Fang, and Liang-Gee Chen, Fellow, IEEE
Abstract—A 124 MSamples/s JPEG 2000 codec is implemented
on a 20.1 mm
2
die with 0.18
m CMOS technology dissipating
385 mW at 1.8 V and 42 MHz. This chip is capable of processing
1920
1080 HD video at 30 fps. For previous works, the tile-level
pipeline scheduling is used between the discrete wavelet transform
(DWT) and embedded block coding (EBC). For a tile with size
256
256, it costs 175 kB on-chip SRAM for the architectures
using on-chip tile memory or costs 310 MB/s SDRAM bandwidth
for the architectures using off-chip tile memory. In this design, a
level-switched scheduling is developed to eliminate tile memory
and the DWT and the EBC are pipelined at pixel-level. This sched-
uling eliminates 175 kB on-chip SRAM and 310 MB/s off-chip
SDRAM bandwidth. The level-switched DWT (LS-DWT) and
the code-block switched EBC (CS-EBC) are developed to enable
this scheduling. The codec functions are realized on an unified
hardware, and hardware sharing between encoder and decoder
reduces silicon area by 40%.
Index Terms—HD video, image compression, JPEG 2000.
I. INTRODUCTION
J
PEG 2000 [15]–[17], [21], which is a new still image coding
standard, is well known for its excellent coding performance
and numerous features [19], such as region of interest (ROI),
scalability, error resilience, etc. All these powerful tools can
be provided by an unified algorithm in a single JPEG 2000
codestream.
Fig. 1 shows the functional block diagram of the JPEG
2000 encoder. Unlike JPEG [22], JPEG 2000 uses discrete
wavelet transform (DWT) as the transformation algorithm and
embedded block coding with optimized truncation (EBCOT)
as the entropy-coding algorithm. EBCOT is a two-tiered al-
gorithm. Tier-1 is the embedded block coding (EBC), which
uses context-adaptive arithmetic coder, and tier-2 is post-
compression rate-distortion optimization, which provides op-
timal image quality at a target bit rate. By use of the above
new coding tools, JPEG 2000 outperforms JPEG by more than
2 dB in general [19]. However, the complexity of JPEG 2000 is
much higher than that of JPEG.
Manuscript received June 12, 2006; revised September 17, 2006. This work
was supported in part by the National Science Council, Republic of China, under
Grant 95-2752-E-002-008-PAE, and in part by a MediaTek Fellowship. This
paper was recommended by Associate Editor C. N. Taylor.
Y.-W. Chang, C.-C. Cheng, and L.-G. Chen are with the DSP/IC Design
Laboratory, Graduate Institute of Electronics Engineering and Department of
Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan,
R.O.C. (e-mail: wayne@video.ee.ntu.edu.tw; ccc@video.ee.ntu.edu.tw;
lgchen@video.ee.ntu.edu.tw).
C.-C. Chen and H.-C. Fang were with the Department of Electrical Engi-
neering, National Taiwan University, Taipei 10617, Taiwan, R.O.C. They are
now with MediaTek Inc., Hsinchu 300, Taiwan, R.O.C. (e-mail: chunchia@
video.ee.ntu.edu.tw; honchi@video.ee.ntu.edu.tw).
Digital Object Identifier 10.1109/TCSVT.2006.888819
Several JPEG 2000 codec designs have been reported in the
literature [1], [8], [23], [24]. However, they suffer from either
high operating frequency or large chip area. Amphion’s codec
[8] operates at the frequency higher than 150 MHz to provide
the throughput of 60 MSamples/s (MS/s) and 20 MS/s for the
encoder and decoder, respectively. The design of [1] occupies
144 mm
to achieve about 50 MS/s throughput. Sanyo [23], [24]
developed an efficient JPEG 2000 codec architecture, which
compromises between the throughput and the silicon area while
keeping the operating frequency as low as 54 MHz. However,
the SDRAM bandwidth requirement is so high that two buses
are needed, and the operating frequency of each bus is two times
that of the core.
There are three challenges in the design of efficient JPEG
2000 codec for HD video. First, the large data rate between the
DWT and the EBC requires either large on-chip SRAM or high
SDRAM bandwidth. Second, complicated control and irregular
dataflow of the DWT and the EBC cost large area to meet the
high throughput requirement. Third, hardware sharing between
the encoder and the decoder is difficult due to different computa-
tion characteristics and dataflow. All of the above introduce high
operating frequency, huge memory size, and high memory band-
width for the chip implementation of a high throughput JPEG
2000 codec.
For the conventional architectures, tile-level pipeline sched-
uling, i.e., DWT and EBC pipelined at tile-level, is used due to
two critical problems. First, the dataflow patterns of the DWT
and the EBC are quite different; the DWT generates the coeffi-
cients in a subband-interleaving manner while the EBC encodes
or decodes a code-block within one subband at a time. Second,
the DWT is a word-level algorithm while the EBC is a bit-level
one. Therefore, a tile memory is usually used for transferring
coefficients between the DWT and the EBC. Tile-level pipeline
scheduling introduces either high bandwidth for those architec-
tures storing tiles in off-chip memory [23], [24] or high cost for
those architectures storing tiles in on-chip memory.
In this work [5], we proposed a level-switched scheduling
to solve the above two problems. For a tile size 256
256, it
eliminates 175 kB SRAM tile memory for those architectures
using on-chip tile memory and reduces 310 MB/s memory
bandwidth for those architectures using off-chip tile memory.
By use of this scheduling, the coefficients between the DWT
and the EBC are transferred with a pixel-pipelined dataflow due
to the elimination of tile memory. In this dataflow, no buffer
is required between the DWT and the EBC. The coefficients
generated by the DWT are encoded by the EBC immediately
for the encoding flow or the decoded coefficients by the EBC
are inverse-transformed immediately by the DWT for the
1051-8215/$25.00 © 2007 IEEE

CHANG et al.: 124 MSAMPLES/S PIXEL-PIPELINED MOTION-JPEG 2000 CODEC WITHOUT TILE MEMORY 399
Fig. 1. Functional block diagram of the JPEG 2000 encoder. JPEG 2000 adopts DWT as transform algorithm and EBC as entropy coding algorithm. The RDO
maximize image quality at a target bit rate.
Fig. 2. Data hierarchy in JPEG 2000. An image is decomposed into tiles, sub-bands, code-blocks, bit-planes, and coding passes.
decoding ow. To enable this scheduling, a level-switched
DWT (LS-DWT) and a code-block switched EBC (CS-EBC)
are developed. The LS-DWT and the CS-EBC process multiple
code-blocks in multiple subbands with an interleaving manner
to eliminate the tile memory. The encoding and decoding
functions are implemented on an unied hardware with little
overhead for the control circuits. By use of the above tech-
niques, the codec chip capable of processing 1920
1080 HD
4:2:2 video format at 30 frames per second (fps) is realized on
a 20.1 mm
die with 0.18 m CMOS technology dissipating
385 mW at 1.8 V and 42 MHz. Hardware sharing between
encoder and decoder reduces silicon costs by 40%.
The organization of this paper is as follows. Section II gives
some background information about JPEG 2000. Section III de-
scribes the proposed level-switched scheduling and Section IV
shows the developed architectures. Implementation results and
comparisons with the previous works are shown in Section VI.
Finally, Section VII concludes this paper.
II. JPEG 2000 O
VERVIEW
In JPEG 2000, an image is decomposed into various abstract
levels for coding, as shown in Fig. 2. The image is partitioned
into tiles, which are independently coded. Each tile is decom-
posed by the DWT into subbands with certain decomposition
levels. For example, seven subbands are generated with two
decomposition levels. Each subband is further partitioned into
code-blocks, and each code-block is independently encoded by
the EBC.
A. Discrete Wavelet Transform
In JPEG 2000, each tile is transformed by an multi-level and
two-dimensional (2-D) DWT. For the forward DWT, an
th LL
subband (
) is decomposed into four subbands ,
, , and . Fig. 3 shows an example that
an 8
8 tile is decomposed into four subband. Note that the
Fig. 3. A 8
2
8 tile is decomposed into four subbands. Each numbered circle
represents the output order of each coefcient in each subband.
denotes the original tile and the numbered circles denote
the output order of each coefcient in each subband. As can be
seen, the output of generated coefcients are interleaved in four
subbands. In each level, a 2-D DWT can be factorized with two
one-dimensional (1-D) DWT. The 2-D DWT is achieved by
using vertical 1-D DWT rst then being followed by the hori-
zontal 1-D DWT. The LL band is obtained by low-pass ltering
in both horizontal and vertical directions and the HH band is
obtained by high-pass ltering in both directions. The HL (LH)
band is obtained by high-pass ltering in the horizontal (ver-
tical) direction and low-pass ltering in the vertical (horizontal)
direction. For the inverse-transformed DWT, the procedure is
a reverse of the procedure for the forward DWT, i.e.,
,
, , and compose subband.
B. Embedded Block Coding
The Embedded Block Coding (EBC) algorithm contains
context formation and context-adaptive arithmetic coder.
Fig. 4(a) and (b) shows the block diagram of the EBC algo-
rithms for the encoder and decoder, respectively. For the EBC
in the encoder, the context formation generates a pair of context
and decision bit, and the context-adaptive arithmetic encoder
generates embedded bit streams. For the EBC in the decoder,

400 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 4, APRIL 2007
Fig. 4. (a) Block diagram of the EBC algorithm for the encoder. The context
formation generates a pair of context and decision bit, and the context-adaptive
arithmetic encoder generate embedded bit streams. (b) Block diagram of the
EBC algorithm for the decoder. The context formation generates context, and
the context-adaptive arithmetic decoder receive it and embedded bit streams to
decode decision bit.
Fig. 5. A spatial scan order, called stripe scan, is used to scan a bit-plane. A
stripe has the size of 4
2
N
. A bit-plane is scanned stripe by stripe and column
by column from left to right in a stripe.
the context formation generates context, and the context-adap-
tive arithmetic decoder receives it and embedded bit streams to
decode decision bit.
As shown in Fig. 2, the basic coding unit for the EBC is a code-
block. The DWT coefcients in a code-block are sign-magnitude
represented, and are encoded or decoded from the Most Signi-
cant Bit (MSB) bit-plane to the Least Signicant Bit (LSB) bit-
plane. Each bit-plane is scanned by three coding passes, Pass
1 (signicant propagation pass), Pass 2 (magnitude renement
pass), and Pass 3 (clean-up pass). For each coding pass, a spe-
cial coding order, called stripe scan, is used to scan a bit-plane.
A stripe has the size of 4
, where is the width of a code-
block. Fig. 5 shows the stripe scan. A bit-plane is scanned stripe
by stripe and column by column from left to right in a stripe.
III. L
EVEL-SWITCHED SCHEDULING
There are two critical problems to design an efcient and
high-throughput JPEG 2000 system. The rst one is dataow
mismatch between the EBC and the DWT. The output/input
dataow of the DWT and input/output scan order of the EBC
are different in the encoder/decoder. The dataow of DWT co-
efcients interleaves in four subbands while the EBC process
a code-block within a subband. Besides, the scan order of the
EBC is stripe scan, which is different from the scan order of the
DWT, which is column by column and row by row in a subband.
The dataow mismatch introduces large temporal buffer for the
dataow conversion between the DWT and EBC in the codec
system . The second problem is throughput mismatch between
the DWT and the EBC. The DWT is a word-level processing al-
gorithm while the EBC is a bit-plane sequential processing al-
gorithm. In the encoder, the DWT coefcients of a code-block
should be buffered since the EBC processes one bit-plane at a
time. Therefore, not only the the EBC is the throughput bot-
tleneck of the entire system but also the multiple memory ac-
cesses for DWT coefcients due to EBCs sequential property
introduce the waste of power consumption. Due to the two crit-
ical problems, the previous architectures [23], [12], [24] use
tile-level pipeline scheduling, i.e., the EBC processes the current
tile while the DWT processes the next tile in the encoder system
by using tile memory. For the target specication 1920
1080
4:2:2 30 fps and tile size 256
256, it costs 175 kB (
bits) memory requirement for storing 10-bit DWT coef-
cients of two tiles and 310 MB/s (
)
coefcients transmission between the DWT and the EBC. Note
that the 310 MB/s only contains the amount of data transmitted
between the DWT and the EBC through tile memory. The band-
width requirement for multi-level transformation of the DWT
is not included since it depends on which DWT architecture is
adopted.
In this section, a level-switched scheduling is proposed to
solve above two mismatch problems. By use of this scheduling,
the tile memory can be eliminated at a cost of a little additional
memory buffer for the DWT and the EBC. This scheduling elim-
inates 175 kB SRAM tile memory for those architectures using
on-chip tile memory and reduces 310 MB/s memory bandwidth
for those architectures using off-chip tile memory.
To enable this scheduling, the parallel mode must be turned
on for the EBC (CAUSAL, RESTART, and RESET are enabled
[17]). In this mode, the arithmetic coder is terminated at end
of each coding pass and the samples that come from the next
stripe are considered insignicant. As a result of the two re-
strictions, the image quality of parallel mode is slightly worse
than that of the default mode. The average peak signal-to-noise
ratio (PSNR) loss is about 0.15 dB for 64
64 code-block and
0.35 dB for 32
32 code-block at medium bit-rate [20].
The key concept of the level-switched scheduling is to change
operational coding ow in a tile to minimize the memory re-
quirement between the DWT and the EBC. As we know, the use
of tile memory arises from the dataow mismatch between the
DWT and the EBC. The memory size between the DWT and the
EBC is proportional to the data lifetime of the DWT coefcients.
If the lifetime for buffering DWT coefcients is shortened, the
memory size is also reduced. Therefore, matching dataow be-
tween the DWT and the EBC is a key to reduce memory re-
quirement. As described in Section II, the basic coding unit in
a code-block for the EBC is a stripe with size 4
, where
is code-block size. The processing order for the stripes in
a code-block cannot be changed since the order is dened by
the standard, but the DWT can change its scan order. There-
fore, in the proposed scheduling, the scan order of the DWT is
changed to stripe scan to match the scan order of the EBC. Be-
sides, the DWT switches between levels to avoid accumulation
of the DWT coefcients due to multi-level decompositions. To

CHANG et al.: 124 MSAMPLES/S PIXEL-PIPELINED MOTION-JPEG 2000 CODEC WITHOUT TILE MEMORY 401
Fig. 6. Level-switched scheduling for JPEG 2000 encoder. Each rectangle on
the left side represents a computation state both for the DWT and the EBC, and
the number in it indicates the processing order. Each computation state requires
256 cycles to process either one 64
2
4 stripe or two 32
2
4 stripes in each
subband.
co-operate with the DWT, the EBC is designed to be capable of
switching between code-blocks.
The detail of the proposed scheduling for the encoding ow is
shown in Fig. 6. Each rectangle in the left side represents a com-
putation state both for the DWT and the EBC, and the number
in it indicates the processing order. The computation state in-
dicated by
means that the DWT and the EBC
process the
th to th stripes of the code-block with number
in the th tile. Each computation state requires 256 cycles to
process either one 64
4 stripe or two 32 4 stripes in each
subband. The dataow of the DWT is designed to match the
stripe scan of the EBC, and the EBC is designed to be capable of
processing one coefcient per cycle to match the the word-level
throughput of the DWT. Three EBC are used to process three
DWT coefcients in three subband. Note that the stripes in
( subband) are processed by one of three EBC, while the
other two EBC are idled. The operational sequences for the
stripes in Fig. 6 are described as follows. At level 1 decompo-
sition, the DWT generates coefcients in four subbands (com-
putation state
and ). The coefcients in the ,
, and subbands are processed by three EBC imme-
diately while the coefcients in the
subband are buffered
for the next level decomposition. The DWT and the EBC switch
to level 2 decomposition to process computation state 8 as soon
as the buffered coefcients are enough for a computation state.
After computation state 8 is nished, the DWT switches back to
Level 1 to continue the unnished parts. By use of this sched-
uling, the buffer between the DWT and the EBC is eliminated
by processing stripes with an interleaved manner. To enable this
scheduling, the DWT should buffer the unnished coefcients
for each LL band in each level and the EBC should buffer the
coding states of each unnished code-block.
Fig. 7. Block diagram of the JPEG 2000 codec system. It contains a main con-
troller, a 3-level DWT module, three embedded block coding (EBC) modules, a
rate-distortion optimization (RDO) controller, and a bit stream controller (BSC).
For the scheduling for the decoding ow, the operational se-
quences for the stripes are opposite to those for the encoding ow.
Atthebeginningtodecode a tile, one of three EBCdecodes the co-
efcientsin
subband and these coefcients are buffered.The
DWT and the EBC switch to level2 when the numbers of buffered
coefcients are enough. At level 2, three EBC decode the coef-
cients in the
, , and subbands, and the DWT com-
posesthe coefcientsin four subbands to generate thecoefcients
in
. Note that the numbers of buffered coefcients for each
LL band in each level are the same as those in the encoding
scheduling. The additional buffer to enable the encoding sched-
uling can be fully shared for the decoding scheduling.
IV. JPEG 2000 C
ODEC ARCHITECTURE
Fig. 7 shows the block diagram of the codec. It contains a
main controller, a 3-level DWT module, three embedded block
coding (EBC) modules, a rate-distortion optimization (RDO)
controller, and a bit stream controller (BSC). The RDO con-
troller maximizes image quality at a given target bit rate. Both
the DWT and the EBC are pixel-pipelined such that no tile
memory is required between the DWT and the EBC. Moreover,
both the encoding and the decoding are one-pass, that is, no co-
efcient transmission to or from SDRAM.
To enable the level-switched scheduling, the level-switched
DWT (LS-DWT) and the code-block switched EBC (CS-EBC)
are developed. The detailed architectures are elaborated in the
following sections.
A. Level-Switched DWT Architecture
Fig. 8 shows the architecture of LS-DWT. It contains two
1-D DWT modules, an LL-band buffer, and an inter-level line
buffer. The direction of arrows shows the dataow of the for-
ward transformation. The LS-DWT is designed to be capable of
processing four coefcients per cycle. For a computation sate
with 256 cycles in the level-switched scheduling, this architec-
ture can decompose a 128
8or64 16 block in and gen-
erates 64
4or32 8 coefcients, respectively, in each sub-

402 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 4, APRIL 2007
Fig. 8. Dataow of the LS-DWT for forward transformation. It decomposes
a 128
2
8 block and generates 64
2
4 coefcients in each subband. The coef-
cients in
LH, HL, and HH are encoded by the CS-EBC as soon as they are
generated while the coefcients in HH are stored at the LL-band buffer for the
next level decomposition.
band ( , , , and ). The coefcients
in
, , and are encoded by the CS-EBC
as soon as they are generated such that no memory buffer is
required to buffer these coefcients, while the coefcients in
band are stored at the LL-band buffer for the next level
decomposition. The DWT switches to the next level decompo-
sition as soon as the amount of data in the LL-band buffer are
enough for a computation state.
The LS-DWT is based on our previously proposed 2-D DWT
architecture [6]. The DWT architecture in [6] uses a line-buffer
to buffer the partially transformed coefcients [14] to avoid mul-
tiple accesses for the coefcients in the column direction and
uses nonoverlapped stripe scan to eliminate the line-buffer in the
row direction. By using line-based architecture, only one read
for each pixel is required, which is the theoretical lower bound.
Based on the analysis about bit width [2], the internal bit width
used in this architecture is 14 bits and the output DWT coef-
cient is reduced to 10 bits. The simulation result shows that the
image quality is about
dB, which is not distinguishable
by human eyes.
To enable the level-switched scheduling, the inter-level line
buffer for the column 1-D DWT and LL-band buffer for the row
1-D DWT are used to buffer the partially transformed coef-
cients and generated coefcients, respectively, for each level.
For the inter-level line buffer, four lines are required for each
level since 9/7 lter is supported [14]. Therefore, the memory
requirement of the inter-level line buffer is 3 kB (
bits). To reduce the memory requirement for the
LL-band buffer, 8, 16, and 8 lines are used to buffer
, ,
and
coefcients, respectively. To obtain four basic coding
stripes (
) in four subbands for level-2 decom-
position, it should buffer 8 lines (
) for . The 8
lines in
are equal to two stripes and each stripe with size
32
4. Two stripes are matched to two basic coding stripes for
the EBC and are matched to a computation state with 256 cycles.
The buffer size for the
( ) is derived from the size
Fig. 9. Word-level EBC codec architecture. It can encode or decode one DWT
coefcient per cycle. The coefcient register bank (CRB) is designed to match
the scanning data ow of the EBC. The parallel context formation (PCF) process
all bit-planes in parallel to generate contexts. The four-symbol arithmetic coder
(FAC) processes all the contexts from a bit plane in one cycle.
used for the
( ). To ll buffered lines for up, it
can be achieved by decomposing the data in
buffer by four
times. Theoretically, this buffer size is the minimal. However,
in the actual implementation, additional four lines are used due
to the latency of the LS-DWT. Therefore, the total buffer size is
5.2 kB (
bits).
B. Code-Block Switched EBC Architecture
The EBC is the throughput bottleneck of a high-performance
JPEG 2000 codec. In [12], a word-level EBC encoder is used to
increase the throughput. However, the throughput depends on
the complexity of image source and the target image quality.
In this work, a word-level EBC codec, which guarantees one
coefcient encoding/decoding per cycle, is developed. Fig. 9
shows the block diagram of the EBC codec, which processes a
10-bit DWT coefcient per cycle. The coefcient register bank
(CRB) is designed to match the scanning data ow of the EBC.
The parallel context formation (PCF) process all bit-planes in
parallel to generate contexts. The four-symbol arithmetic coder
(FAC) is proposed to encode/decode all the contexts from a bit
plane in one cycle.
To match the level-switched scheduling of the DWT, 2.5 kB
probability state memory and 0.34 kB inter-code-block line
buffer are required for an EBC module to store the coding states
of the unnished code-blocks and the last row in the previous
coding stripe for each code-block. The probability state memory
is used to buffers the coding states in the probability state register
bank (PSRB), which is used to store the coding states for the FAC,
when switching to another code-block, and loads the states back
to the PSRB before continuing the unnished code-block. The
coding states require 399 bits for a FAC in a bit-plane [10] and
total 3990 bits for a code-block with 10-bit magnitude bit-plane.
Although there are seven code-blocks (
to ) should be
processed by the EBC, only ve of them are switched to each
other at a time since
and are processed after and
. The probability state memory for and is re-used
for
and . Therefore, the probability state memory
of EBC requires 19950 bits (
bits kB).
At the same way, the inter-level line buffer requires 0.34 kB

Citations
More filters

Journal ArticleDOI
TL;DR: An algorithm and a hardware architecture of a new type EC codec engine with multiple modes are presented and the proposed four-tree pipelining scheme can reduce 83% latency and 67% buffer size between transform and entropy coding.
Abstract: In a typical portable multimedia system, external access, which is usually dominated by block-based video content, induces more than half of total system power. Embedded compression (EC) effectively reduces external access caused by video content by reducing the data size. In this paper, an algorithm and a hardware architecture of a new type EC codec engine with multiple modes are presented. Lossless mode, and lossy modes with rate control modes and quality control modes are all supported by single algorithm. The proposed four-tree pipelining scheme can reduce 83% latency and 67% buffer size between transform and entropy coding. The proposed EC codec engine can save 62%, 66%, and 77% external access at lossless mode, half-size mode, and quarter-size mode and can be used in various system power conditions. With TSMC 0.18 mum 1P6M CMOS logic process, the proposed EC codec engine can encode or decode CIF 30 frame per second video data and achieve power saving of more than 109 mW. The EC codec engine itself consumes only 2 mW power.

63 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: An algorithm and a hardware architecture of a new type EC codec engine with multiple modes are presented and the proposed four-tree pipelining scheme can reduce 83% latency and 67% buffer size between transform and entropy coding.
Abstract: In a typical multi-chip handheld system for multi-media applications, external access, which is usually dominated by block-based video content, induces more than half of total system power. Embedded compression (EC) effectively reduces external access caused by video content by reducing the data size. In this paper, an algorithm and a hardware architecture of a new type EC codec engine with multiple modes are presented. Lossless mode, and lossy modes with rate control modes and quality control modes are all supported by single algorithm. The proposed four-tree pipelining scheme can reduce 83% latency and 67% buffer size between transform and entropy coding. The proposed EC codec engine can save 50%, 61%, and 77% external access at lossless mode, half-size mode, and quarter-size mode and can be used in various system power conditions. With Artisan 0.18 /spl mu/m cell library, the proposed EC codec engine can encode or decode at VGA@30fps with area and power consumption of 293,605 /spl mu/m/sup 2/ and 3.36 mW.

26 citations


Journal ArticleDOI
Ahmed Chefi1, Adel Soudani2, Gilles Sicard1Institutions (2)
TL;DR: The aim of this paper is to present and evaluate a hardware implementation for user-driven image compression scheme designed to respect the energy constraints of image transmission over wireless sensor networks (WSNs).
Abstract: Software implementation costs of most algorithms, designed for image compression in wireless sensor networks, do not justify their use to reduce the energy consumption and delay transmission of images. Even though the hardware solution looks to be very attractive for this problem, a specific care should be paid when designing a low power algorithm for image compression and transmission over these systems. The aim of this paper is to present and evaluate a hardware implementation for user-driven image compression scheme designed to respect the energy constraints of image transmission over wireless sensor networks (WSNs). The proposed encoder will be considered as a co-processor for tasks related with image compression and data packetization. In this paper, we discuss both of the hardware architecture and the features of this encoder circuit when prototyped on FPGA (field-programmable gate array) and ASIC (application-specific integrated circuit) circuits.

18 citations


Proceedings Article
01 Aug 2011
TL;DR: Hardware architecture of JPEG2000 encoder core, oriented for HD video broadcast and surveillance applications, implemented in VHDL and synthesised for FPGA devices, and ASIC 0.13 μm CMOS technology is presented.
Abstract: This article presents hardware architecture of JPEG2000 encoder core, oriented for HD video broadcast and surveillance applications. Thanks to developed efficient 2-D DWT engine that is capable of computing four coefficients per clock cycle, and adopted two EBCOT TIER-1 modules, with smart switching of the channels, the maximum compression speed of 180 Msamples/s at 100 MHz, in lossy mode is achieved. The architecture is implemented in VHDL and synthesised for FPGA devices, and ASIC 0.13 µm CMOS technology. Performance simulations, conducted on a set of natural images and video sequences, have revealed that the encoder is capable of processing 1080p 4∶4∶4 signal with a speed of 30 frames per second. Additionally, an excellent quality of reconstructed images has been observed, with respect to the reference, software encoder.

11 citations


Cites background from "124 MSamples/s Pixel-Pipelined Moti..."

  • ...All these implementations meet different design challenges that the researchers try to overcome....

    [...]


Journal ArticleDOI
Yu-Wei Chang1, Hung-Chi Fang1, Chun-Chia Chen1, Chung-Jr Lian1  +1 moreInstitutions (1)
Abstract: This paper presents a word-level decoding architecture of embedded block coding in JPEG 2000. This architecture decodes one coefficient per cycle based on the proposed word-level decoding algorithm. This algorithm eliminates state variable memories by decoding all bit-planes in parallel. The proposed column- switching scan order overcomes intra bit-plane dependency and inter bit-plane dependency to enable parallel processing. Implementation results show that the proposed architecture is capable of decoding 54 MSamples/s at 54 MHz, which can support HDTV 720p (1280 times 720, 4:2:2) decoding at 30 frames/s.

8 citations


Cites methods from "124 MSamples/s Pixel-Pipelined Moti..."

  • ...Therefore, we use registers to implement the coding states of AD....

    [...]


References
More filters

Book
31 Dec 1992
TL;DR: This chapter discusses JPEG Syntax and Data Organization, the history of JPEG, and some of the aspects of the Human Visual Systems that make up JPEG.
Abstract: Foreword. Acknowledgments. Trademarks. Introduction. Image Concepts and Vocabulary. Aspects of the Human Visual Systems. The Discrete Cosine Transform (DCT). Image Compression Systems. JPEG Modes of Operation. JPEG Syntax and Data Organization. Entropy Coding Concepts. JPEG Binary Arithmetic Coding. JPEG Coding Models. JPEG Huffman Entropy Coding. Arithmetic Coding Statistical. More on Arithmetic Coding. Probability Estimation. Compression Performance. JPEG Enhancements. JPEG Applications and Vendors. Overview of CCITT, ISO, and IEC. History of JPEG. Other Image Compression Standards. Possible Future JPEG Directions. Appendix A. Appendix B. References. Index.

3,130 citations


"124 MSamples/s Pixel-Pipelined Moti..." refers background in this paper

  • ...1 shows the functional block diagram of the JPEG 2000 encoder....

    [...]


Book
David Taubman1, Michael W. MarcellinInstitutions (1)
30 Nov 2001
TL;DR: This work has specific applications for those involved in the development of software and hardware solutions for multimedia, internet, and medical imaging applications.
Abstract: This is nothing less than a totally essential reference for engineers and researchers in any field of work that involves the use of compressed imagery. Beginning with a thorough and up-to-date overview of the fundamentals of image compression, the authors move on to provide a complete description of the JPEG2000 standard. They then devote space to the implementation and exploitation of that standard. The final section describes other key image compression systems. This work has specific applications for those involved in the development of software and hardware solutions for multimedia, internet, and medical imaging applications.

2,938 citations


Journal ArticleDOI
David Taubman1Institutions (1)
TL;DR: A new image compression algorithm is proposed, based on independent embedded block coding with optimized truncation of the embedded bit-streams (EBCOT), capable of modeling the spatially varying visual masking phenomenon.
Abstract: A new image compression algorithm is proposed, based on independent embedded block coding with optimized truncation of the embedded bit-streams (EBCOT). The algorithm exhibits state-of-the-art compression performance while producing a bit-stream with a rich set of features, including resolution and SNR scalability together with a "random access" property. The algorithm has modest complexity and is suitable for applications involving remote browsing of large compressed images. The algorithm lends itself to explicit optimization with respect to MSE as well as more realistic psychovisual metrics, capable of modeling the spatially varying visual masking phenomenon.

1,907 citations


Journal ArticleDOI
TL;DR: Some of the most significant features of the standard are presented, such as region-of-interest coding, scalability, visual weighting, error resilience and file format aspects, and some comparative results are reported.
Abstract: One of the aims of the standardization committee has been the development of Part I, which could be used on a royalty- and fee-free basis. This is important for the standard to become widely accepted. The standardization process, which is coordinated by the JTCI/SC29/WG1 of the ISO/IEC has already produced the international standard (IS) for Part I. In this article the structure of Part I of the JPFG 2000 standard is presented and performance comparisons with established standards are reported. This article is intended to serve as a tutorial for the JPEG 2000 standard. The main application areas and their requirements are given. The architecture of the standard follows with the description of the tiling, multicomponent transformations, wavelet transforms, quantization and entropy coding. Some of the most significant features of the standard are presented, such as region-of-interest coding, scalability, visual weighting, error resilience and file format aspects. Finally, some comparative results are reported and the future parts of the standard are discussed.

1,614 citations


"124 MSamples/s Pixel-Pipelined Moti..." refers background in this paper

  • ...I. INTRODUCTION JPEG 2000 [15]–[17], [21], which is a new still image codingstandard, is well known for its excellent coding performance and numerous features [19], such as region of interest (ROI), scalability, error resilience, etc....

    [...]


Proceedings ArticleDOI
David Taubman1Institutions (1)
24 Oct 1999
TL;DR: A new image compression algorithm is proposed, based on independent embedded block coding with optimized truncation of the embedded bit-streams (EBCOT), capable of modeling the spatially varying visual masking phenomenon.
Abstract: A new image compression algorithm is proposed, based on independent embedded block coding with optimized truncation of the embedded bit-streams (EBCOT). The algorithm exhibits state-of-the-art compression performance while producing a bit-stream with a rich feature set, including resolution and SNR scalability together with a random access property. The algorithm has modest complexity and is extremely well suited to applications involving remote browsing of large compressed images. The algorithm lends itself to explicit optimization with respect to MSE as well as more realistic psychovisual metrics, capable of modeling the spatially varying visual masking phenomenon.

1,450 citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20201
20142
20112
20092
20081
20072