scispace - formally typeset
Open AccessJournal ArticleDOI

Analysis and architecture design of variable block-size motion estimation for H.264/AVC

Reads0
Chats0
TLDR
Two hardware architectures are proposed that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches and an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation is proposed.
Abstract
Variable block-size motion estimation (VBSME) has become an important video coding technique, but it increases the difficulty of hardware design. In this paper, we use inter-/intra-level classification and various data flows to analyze the impact of supporting VBSME in different hardware architectures. Furthermore, we propose two hardware architectures that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches. By broadcasting reference pixel rows and propagating partial sums of absolute differences (SADs), the first design has the fewer reference pixel registers and a shorter critical path. The second design utilizes a two-dimensional distortion array and one adder tree with the reference buffer that can maximize the data reuse between successive searching candidates. The first design is suitable for low resolution or a small search range, and the second design has advantages of supporting a high degree of parallelism and VBSME. Finally, we propose an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation (IME). Its processing ability is eight times of the single SAD tree, but the reference buffer size is only doubled. Moreover, the most critical issue of H.264 IME, which is huge memory bandwidth, is overcome. We are able to save 99.9% off-chip memory bandwidth and 99.22% on-chip memory bandwidth. We demonstrate a 720-p, 30-fps solution at 108 MHz with 330.2k gate count and 208k bits on-chip memory

read more

Content maybe subject to copyright    Report

578 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 2, FEBRUARY 2006
Analysis and Architecture Design of Variable
Block-Size Motion Estimation for H.264/AVC
Ching-Yeh Chen, Shao-Yi Chien, Yu-Wen Huang, Tung-Chien Chen, Tu-Chih Wang, and
Liang-Gee Chen, Fellow, IEEE
Abstract—Variable block-size motion estimation (VBSME) has
become an important video coding technique, but it increases the
difficulty of hardware design. In this paper, we use inter-/intra-
level classification and various data flows to analyze the impact of
supporting VBSME in different hardware architectures. Further-
more, we propose two hardware architectures that can support
traditional fixed block-size motion estimation as well as VBSME
with less chip area overhead compared to previous approaches. By
broadcasting reference pixel rows and propagating partial sums
of absolute differences (SADs), the first design has the fewer ref-
erence pixel registers and a shorter critical path. The second de-
sign utilizes a two-dimensional distortion array and one adder tree
with the reference buffer that can maximize the data reuse between
successive searching candidates. The first design is suitable for low
resolution or a small search range, and the second design has ad-
vantages of supporting a high degree of parallelism and VBSME.
Finally, we propose an eight-parallel SAD tree with a shared ref-
erence buffer for H.264/AVC integer motion estimation (IME). Its
processing ability is eight times of the single SAD tree, but the ref-
erence buffer size is only doubled. Moreover, the most critical issue
of H.264 IME, which is huge memory bandwidth, is overcome. We
are able to save 99.9% off-chip memory bandwidth and 99.22%
on-chip memory bandwidth. We demonstrate a 720-p, 30-fps so-
lution at 108 MHz with 330.2k gate count and 208k bits on-chip
memory.
Index Terms—Block matching, H.264/AVC, motion estimation
(ME), variable block size, very large scale integration (VLSI) ar-
chitecture.
I. INTRODUCTION
F
OR VIDEO coding systems, motion estimation (ME) can
remove most of temporal redundancy, so a high compres-
sion ratio can be achieved. Among various ME algorithms,
a full-search block matching algorithm (FSBMA) is usually
adopted because of its good quality and regular computation.
In FSBMA, the current frame is partitioned into many small
macroblocks (MBs) of size
. For each MB in the current
frame (current MB), one reference block that is the most similar
to current MB is sought in the searching range of size
Manuscript received November 1, 2004; revised April 18, 205 and July 7,
2005. This paper was recommended by Associate Editor Y.-K. Chen.
C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, and L.-G. Chen are
with the DSP/IC Design Laboratory, Graduate Institute of Electronics En-
gineering and Department of Electrical Engineering II, National Taiwan
University, Taipei 10617, Taiwan, R.O.C. (e-mail: cychen@cc.ee.ntu.edu.tw;
shoayi@cc.ee.ntu.edu.tw; yuwen@cc.ee.ntu.edu.tw; djchen@cc.ee.ntu.edu.tw;
lgchen@cc.ee.ntu.edu.tw).
T.-C. Wang is with the DSP/IC Design Laboratory, Graduate Institute of
Electronics Engineering and Department of Electrical Engineering II, National
Taiwan University, Taipei 10617, Taiwan, R.O.C., and also with Chin Fong
Machine Industrial,50445 Chang Hua, Taiwan, R.O.C.
Digital Object Identifier 10.1109/TCSI.2005.858488
in the reference frame. The most common used criterion of the
similarity is the sum of absolute differences (SAD)
SAD
Distortion (1)
Distortion
cur ref (2)
where cur
and ref are pixel values in the cur-
rent MB (current pixel) and reference block (reference pixel),
respectively,
is one searching candidate in the search
range, Distortion
is the difference between the cur-
rent pixel and the reference pixel, and SAD
is the total
distortion of this searching candidate. The row (column) SAD
is the summation of
distortions in a row (column). After all
searching candidates are examined, the searching candidate that
has the smallest SAD is selected as the motion vector of the cur-
rent MB. Although FSBMA provides the best quality among
various ME algorithms, it consumes the largest computation
power. In general, the computation complexity of ME varies
from 50% to 90% of a typical video coding system. Hence, a
hardware accelerator of ME is required.
Variable block-size motion estimation (VBSME) is a new
coding technique and provides more accurate predictions
compared to traditional fixed block-size motion estimation
(FBSME). With FBSME, if an MB consists of two objects with
different motion directions, the coding performance of this MB
is worse. On the other hand, for the same condition, the MB can
be divided into smaller blocks in order to fit the different motion
directions with VBSME. Hence, the coding performance is
improved. VBSME has been adopted in the latest video coding
standards, including H.263 [1], MPEG-4 [2], WMV9.0 [3],
and H.264/AVC [4]. For instance, in H.264/AVC, an MB with
a variable block size can be divided into seven kinds of blocks
including 4
4, 4 8, 8 4, 8 8, 8 16, 16 8, and 16
16. Although VBSME can achieve a higher compression
ratio, it not only requires huge computation complexity but also
increases the difficulty of hardware implementation for ME.
Traditional ME hardware architectures are designed for
FBSME, and they can be classified into two categories. One is
an inter-level architecture, where each processing element (PE)
is responsible for one SAD of a specific searching candidate, as
shown in (1), and the other is an intra-level architecture, where
each PE is responsible for the distortion of a specific current
pixel in the current MB for all searching candidates, as shown
1057-7122/$20.00 © 2006 IEEE

CHEN et al.: ANALYSIS AND ARCHITECTURE DESIGN OF VARIABLE BLOCK-SIZE MOTION ESTIMATION FOR H.264/AVC 579
Fig. 1. Hardware architectures of (a) 1DInterYSW, (b) 2DInterYH, and (c) 2DInterLC, where
N
=4
,
P
=2
, and
P
=2
.
in (2). For example, Yang et al. proposed a one-dimensional
(1-D) inter-level semisystolic architecture [5], and Komarek
and Pirsch proposed a two-dimensional (2-D) intra-level sys-
tolic architecture, AB2, in [6].
In this paper, we not only use inter-/intra-level classica-
tion and various data ows to analyze the impact of supporting
VBSME in different hardware architectures but also propose
two hardware architectures, Propagate Partial SAD and SAD
Tree, that can support VBSME as well as FBSME with less chip
area overhead compared to previous techniques. After analyzing
the impact of supporting VBSME in different hardware archi-
tectures, we discuss the hardware design challenges of integer
motion estimation in H.264/AVC for D1 size as an example.
Because of multiple reference frames and VBSME, integer mo-
tion estimation in H.264/AVC not only requires large computa-
tion complexity but also needs huge memory bandwidth. Based
on the previous analysis, we utilize SAD Tree to design a hard-
ware architecture and reduce the required memory bandwidth
for H.264/AVC integer motion estimation.
The remainder of this paper is organized as follows. In
Section II, six previous hardware architectures of ME are
surveyed rst, and we propose two hardware architectures for
FBSME. Next, we analyze the impact of supporting VBSME
in different hardware architectures and directly extend the six
previous works and our two architectures to support VBSME.
In Section III, we give an example to provide the quantied
comparisons of the eight hardware architectures for FBSME
and VBSME, respectively. After that, based on our analysis
results, we develop a hardware architecture for H.264/AVC
integer motion estimation as an example in Section IV. Finally,
a conclusion is given in Section V.
II. I
MPACT OF SUPPORTING VBSME IN DIFFERENT
HARDWARE ARCHITECTURES
In this section, six representative previous works of ME
hardware architectures for FBSME are introduced in the begin-
ning. They are the works of Yang et al.[5], Yeo and Hu [7], Lai
and Chen [8], Komarek and Pirsch [6], Vos and Stegherr [9],
and Hsieh and Lin [10]. These six architectures are signicant
works, and many hardware architectures are proposed based
on them. For example, reference [11] is the extension of [5].
Reference [12] is proposed based on [9], and reference [13]
combined [10] with multilevel successive elimination algorithm
[14], [15]. Reference [16] is the extension of [6]. Besides pure
inter-/intra-level architectures, there are other kinds of architec-
tures such as AS2 in [6] and a tree-based architecture in [17],
which are hybrids of inter- and intra-level architectures. For the
sake of simplicity, we only discuss the pure inter-/intra-level
architectures, and the others can be easily extended based on
our analysis.
Moreover, we also propose two intra-level hardware archi-
tectures and analyze the impact of supporting VBSME in these
hardware architectures based on various data ows in this sec-
tion. The direct extensions of the eight architectures are also pro-
posed. In the following discussion, we assume that the block size
is
and that the search range is and
in the horizontal and vertical directions.
A. Work of Yang et al.
Yang et al. implemented the rst VLSI motion estimator in
the world, as shown in Fig. 1(a), which is a 1-D inter-level hard-
ware architecture (1DInterYSW). The number of PEs is equal to

580 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 2, FEBRUARY 2006
Fig. 2. Hardware architectures of (a) 2DIntraVS, (b) 2DIntraKP, and (c) 2DIntraHL, where
N
=4
,
P
=2
, and
P
=2
.
the number of searching candidates in the horizontal direction,
. Reference pixels are broadcasted into all PEs. By selec-
tion signals, the corresponding reference pixel is selected and
inputted into each PE. Current pixels are propagated with prop-
agation registers, and the partial SAD is stored in each PE. In
each cycle, each PE computes the distortion and accumulates
the SAD of a searching candidate. In this architecture, the most
important concept is data broadcasting. With broadcasting tech-
nique, the memory bit width which is dened as the number of
bits for the required reference data in one cycle is reduced sig-
nicantly, although some global routings are required.
B. Work of Yeo and Hu
Fig. 1(b) shows a 2-D inter-level hardware architecture which
is proposed by Yeo and Hu (2DInterYH). 2DInterYH consists
of
PEs and is similar to 1DInterYSW. Reference
pixels are broadcasted into PEs, and current pixels are propa-
gated with propagation registers. The partial SADs are stored
and accumulated in PEs, respectively. Because of broadcasting
reference pixels in both directions, the number of PEs has to
match the MB size. Hence, the search range is partitioned into
regions, and each region is computed by
a set of
PEs. The characteristic of 2DInterYH is broad-
casting in two directions at the same time, which can increase
the data reuse.
C. Work of Lai and Chen
Lai and Chen also proposed another 1-D PE array that im-
plemented a 2-D inter-level architecture with two data-inter-
lacing reference arrays (2DInterLC). The hardware architec-
ture is shown in Fig. 1(c) and is similar to 2DInterYH except
for two aspects. Reference pixels are propagated with propa-
gation registers, and current pixels are broadcast into PEs. The
partial SADs are still stored and accumulated in PEs. Besides,
2DInterLC has to load reference pixels into propagation regis-
ters before computing SADs. The latency of loading reference
pixels can be reduced by partitioning the search range in 2DIn-
terLC. For example, the search range can be partitioned into
parts for a shorter latency.
D. Work of Vos and Stegherr
A 2-D intra-level architecture is proposed by Vos and Stegherr
(2DIntraVS), as shown in Fig. 2(a), where the number of PEs is
equal to the MB size. Each PE corresponds to a current pixel,
and current pixels are stored in PEs, respectively. The important
concept of 2DIntraVS is the scanning order in searching can-
didates, which is known as the snake scan. In order to realize
this, a great deal of propagation registers are used to store refer-
ence pixels, and the data in propagation registers can be shifted
in upward, downward, and right directions. These propagation
registers and the long latency for loading reference pixels are

CHEN et al.: ANALYSIS AND ARCHITECTURE DESIGN OF VARIABLE BLOCK-SIZE MOTION ESTIMATION FOR H.264/AVC 581
Fig. 3. (a) Concept, (b) hardware architecture, and (c) detailed architecture of a PE array with a 1-D adder tree using Propagate Partial SAD architecture, where
N
=4
.
the tradeoffs for the reduction of memory usages. The compu-
tation ow is as follows. First, the distortion is computed in each
PE, and
partial-row SADs are propagated and accumulated in
the horizontal direction. Second, an adder tree is used to accu-
mulate the
-row SADs to be SAD. The accumulations of row
SADs and SAD are done in one cycle. Hence no partial SAD is
required to be stored.
E. Work of Komarek and Pirsch
Komarek and Pirsch contributed a detailed systolic mapping
procedure by the dependence graph (DG). By using different
DGs, including different scheduling and projections, different
systolic hardware architectures can be derived. AB2 (2DIn-
traKP) is a 2-D intra-level architecture, as shown in Fig. 2(b).
Current pixels are stored in corresponding PEs. Reference
pixels are propagated PE by PE in the horizontal direction. The
partial-column SADs are propagated and accumulated in
the vertical direction rst. After the vertical propagation, these
-column SADs are propagated in the horizontal direction.
In each PE, the distortion of a current pixel in the current MB
is computed and added with the partial-column SAD which is
propagated in PEs from top to bottom in the vertical direction.
In the horizontal propagation, these
-column SADs are
accumulated one by one by
adders and registers.
F. Work of Hsieh and Lin
Hsieh and Lin proposed another 2-D intra-level hardware ar-
chitecture with a search range buffer (2DIntraHL), as shown in
Fig. 2(c). 2DIntraHL consists of
PE arrays in the vertical di-
rection, and each PE array is composed of
PEs in a row. In
2DIntraHL, reference pixels are propagated with propagation
registers one by one, which can provide the advantages of serial
data input and increasing the data reuse. Current pixels are still
stored in PEs. The
partial-column SADs are propagated in the
vertical direction from bottom to up. In each computing cycle,
each PE array generates
distortions of a searching candidate
and accumulates these distortions with
partial-column SADs
in the vertical propagation. After accumulation in the vertical di-
rection,
-column SADs are accumulated in the top adder tree
in one cycle. The longer latency for loading reference pixels and
large propagation registers are the penalties for the reduction of
memory bandwidth and memory bit width.
G. Proposed Propagate Partial SAD
We propose a 2-D intra-level architecture called the Propa-
gate Partial SAD [18]. Fig. 3(a) and (b) shows the concept and
hardware architecture of Propagate Partial SAD, respectively.
The architecture is composed of
PE arrays with a 1-D adder
tree in the vertical direction. Current pixels are stored in each
PE, and two sets of
continuous reference pixels in a row
are broadcasted to
PE arrays at the same time. In each PE
array with a 1-D adder tree,
distortions are computed and
summed by a 1-D adder tree to generate one-row SAD, as shown
in Fig. 3(c). The row SADs are accumulated and propagated
with propagation registers in the vertical direction, as shown in
the right-hand side of Fig. 3(b).
The detailed data ow of Propagate Partial SAD is shown in
Fig. 4. The reference data of searching candidates in the even
and odd columns are inputted by Ref. Pixels 0 and Ref. Pixels 1,
respectively. After initial cycles, the SAD of the rst searching
candidate in the zeroth column is generated, and the SADs of
the other searching candidates are sequentially generated in the
following cycles. When computing the last
searching can-
didates in each column, the reference data of searching candi-
dates in the next columns begin to be inputted through another
reference input. Then, the hardware utilization is 100% except

582 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 2, FEBRUARY 2006
Fig. 4. Detailed data ow of the proposed Propagate Partial SAD architecture, where
N
=4
and
P
=
P
=2
.
Fig. 5. (a) Concept, (b) hardware architecture, and (c) scan order and memory
access of the proposed SAD Tree architecture, where
N
=4
.
the initial latency. In Propagate Partial SAD, by broadcasting
reference pixel rows and propagating partial-row SADs in the
vertical direction, it provides the advantages of fewer reference
pixel registers and a shorter critical path.
H. Proposed SAD Tree
Fig. 5(a) shows the concept of the proposed SAD Tree archi-
tecture. The proposed SAD Tree is a 2-D intra-level architecture
and consists of a 2-D PE array and one 2-D adder tree with prop-
agation registers, as shown in Fig. 5(b) and (c). Current pixels
are stored in each PE, and reference pixels are stored in propa-
gation registers for data reuse. In each cycle,
current and
reference pixels are inputted to PEs. Simultaneously,
contin-
uous reference pixels in a row are inputted into propagation reg-
isters to update reference pixels. In propagation registers, refer-
ence pixels are propagated in the vertical direction row by row.
In SAD Tree architecture, all distortions of a searching candidate
are generated in the same cycle, and by an adder tree,
distortions are accumulated to derive the SAD in one cycle.
In order to provide a high utilization and data reuse, the snake
scan is adopted and recongurable data path propagation reg-
isters are developed in the proposed SAD Tree, as shown in
Fig. 5(c), which consists of ve basic steps from A to E. The
rst step, A, fetches
pixels in a row and the shift direction of
propagation registers is downward. When calculating the last
candidates in a column, one extra reference pixel is required to
be inputted, that is, step B. When nishing the computation of
one column, the reference pixels in the propagation registers are
shifted left in step C. Because the reference data have already
been stored in the propagation registers, the SAD can be directly
calculated. The next two steps, D and E, are the same as steps A
and B except that the shift direction is upward. After nishing
the computation of one column in the search range, we execute
step C and then go back to step A. This procedure will iterate
until all searching candidates in the search range have been cal-
culated. The detailed data ow is shown in Fig. 6. By snake scan
and recongurable propagation registers, the data reuse between
two successive searching candidates can be maximized, and the
hardware utilization is approaching 100%.
I. Impact of Variable Block-Size Motion Estimation in
Hardware Architectures
There are many methods to support VBSME in hardware ar-
chitectures. For example, we can increase the number of PEs
or the operating frequency to do ME for different block sizes,
respectively. One of them is to reuse the SADs of the smallest
blocks, which are the blocks partitioned with the smallest block
size, to derive the SADs of larger blocks. By this method, the
overhead of supporting VBSME is only the slight increase of
gate count, and the other factors, such as frequency, hardware
utilization, memory usage, and so on, are the same as those
of FBSME. When this method is adopted, the circuit for the
SAD calculation is the only difference between FBSME and
VBSME for hardware designs. Hence, the impact of supporting
VBSME in hardware architectures is dependent on the different
data ows of partial SADs. In inter-level architectures, the data
ow of partial SADs is simple, where the partial SADs are
stored in each PE. In intra-level architectures, there are two

Citations
More filters
Proceedings ArticleDOI

Understanding sources of inefficiency in general-purpose chips

TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.
Journal ArticleDOI

CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network

TL;DR: The convolution neural network based fast algorithm is devised to decrease no less than two CU partition modes in each CTU for full rate-distortion optimization (RDO) processing, thereby reducing the encoder's hardware complexity.
Journal ArticleDOI

Robust Video Watermarking of H.264/AVC

TL;DR: Experimental results show that the proposed video watermarking scheme can robustly survive transcoding process and strong common signal processing attacks, such as bit-rate reduction, Gaussian filtering and contrast enhancement.
Journal ArticleDOI

An End-to-End Learning Framework for Video Compression

TL;DR: This paper proposes the first end-to-end deep video compression framework that can outperform the widely used video coding standard H.264 and be even on par with the latest standard H265.
Journal ArticleDOI

Fast Algorithm and Architecture Design of Low-Power Integer Motion Estimation for H.264/AVC

TL;DR: A hardware-oriented fast algorithm based on the systolic array and 2-D adder tree architecture, a ladder-shaped search window data arrangement and an advanced searching flow are proposed to efficiently support inter-candidate DR and reduce latency cycles.
References
More filters
Journal ArticleDOI

A family of VLSI designs for the motion compensation block-matching algorithm

TL;DR: In this article, a family of modular VLSI architectures and chip implementations of the motion-compensation full-search block-matching algorithm are described, motivated by the intensive computations required to perform motion compensation in real time.
Journal ArticleDOI

Array architectures for block matching algorithms

TL;DR: In this paper, a description of VLSI architectures for block-matching algorithms utilizing systolic array processors is given, and a well-known mapping procedure has been applied to derive the array processors from the algorithm.
Journal ArticleDOI

On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture

TL;DR: This work explores the data reuse properties of full-search block-matching for motion estimation (ME) and associated architecture designs, as well as memory bandwidth requirements, and a seven-type classification system is developed that can accommodate most published ME architectures.
Related Papers (5)