Analysis and architecture design of variable block-size motion estimation for H.264/AVC

doi:10.1109/TCSI.2005.858488

578 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 2, FEBRUARY 2006

Analysis and Architecture Design of Variable

Block-Size Motion Estimation for H.264/AVC

Ching-Yeh Chen, Shao-Yi Chien, Yu-Wen Huang, Tung-Chien Chen, Tu-Chih Wang, and

Liang-Gee Chen, Fellow, IEEE

Abstract—Variable block-size motion estimation (VBSME) has

become an important video coding technique, but it increases the

difﬁculty of hardware design. In this paper, we use inter-/intra-

level classiﬁcation and various data ﬂows to analyze the impact of

supporting VBSME in different hardware architectures. Further-

more, we propose two hardware architectures that can support

traditional ﬁxed block-size motion estimation as well as VBSME

with less chip area overhead compared to previous approaches. By

broadcasting reference pixel rows and propagating partial sums

of absolute differences (SADs), the ﬁrst design has the fewer ref-

erence pixel registers and a shorter critical path. The second de-

sign utilizes a two-dimensional distortion array and one adder tree

with the reference buffer that can maximize the data reuse between

successive searching candidates. The ﬁrst design is suitable for low

resolution or a small search range, and the second design has ad-

vantages of supporting a high degree of parallelism and VBSME.

Finally, we propose an eight-parallel SAD tree with a shared ref-

erence buffer for H.264/AVC integer motion estimation (IME). Its

processing ability is eight times of the single SAD tree, but the ref-

erence buffer size is only doubled. Moreover, the most critical issue

of H.264 IME, which is huge memory bandwidth, is overcome. We

are able to save 99.9% off-chip memory bandwidth and 99.22%

on-chip memory bandwidth. We demonstrate a 720-p, 30-fps so-

lution at 108 MHz with 330.2k gate count and 208k bits on-chip

memory.

Index Terms—Block matching, H.264/AVC, motion estimation

(ME), variable block size, very large scale integration (VLSI) ar-

chitecture.

I. INTRODUCTION

F

OR VIDEO coding systems, motion estimation (ME) can

remove most of temporal redundancy, so a high compres-

sion ratio can be achieved. Among various ME algorithms,

a full-search block matching algorithm (FSBMA) is usually

adopted because of its good quality and regular computation.

In FSBMA, the current frame is partitioned into many small

macroblocks (MBs) of size

. For each MB in the current

frame (current MB), one reference block that is the most similar

to current MB is sought in the searching range of size

Manuscript received November 1, 2004; revised April 18, 205 and July 7,

2005. This paper was recommended by Associate Editor Y.-K. Chen.

C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, and L.-G. Chen are

with the DSP/IC Design Laboratory, Graduate Institute of Electronics En-

gineering and Department of Electrical Engineering II, National Taiwan

University, Taipei 10617, Taiwan, R.O.C. (e-mail: cychen@cc.ee.ntu.edu.tw;

shoayi@cc.ee.ntu.edu.tw; yuwen@cc.ee.ntu.edu.tw; djchen@cc.ee.ntu.edu.tw;

lgchen@cc.ee.ntu.edu.tw).

T.-C. Wang is with the DSP/IC Design Laboratory, Graduate Institute of

Electronics Engineering and Department of Electrical Engineering II, National

Taiwan University, Taipei 10617, Taiwan, R.O.C., and also with Chin Fong

Machine Industrial,50445 Chang Hua, Taiwan, R.O.C.

Digital Object Identiﬁer 10.1109/TCSI.2005.858488

in the reference frame. The most common used criterion of the

similarity is the sum of absolute differences (SAD)

SAD

Distortion (1)

Distortion

cur ref (2)

where cur

and ref are pixel values in the cur-

rent MB (current pixel) and reference block (reference pixel),

respectively,

is one searching candidate in the search

range, Distortion

is the difference between the cur-

rent pixel and the reference pixel, and SAD

is the total

distortion of this searching candidate. The row (column) SAD

is the summation of

distortions in a row (column). After all

searching candidates are examined, the searching candidate that

has the smallest SAD is selected as the motion vector of the cur-

rent MB. Although FSBMA provides the best quality among

various ME algorithms, it consumes the largest computation

power. In general, the computation complexity of ME varies

from 50% to 90% of a typical video coding system. Hence, a

hardware accelerator of ME is required.

Variable block-size motion estimation (VBSME) is a new

coding technique and provides more accurate predictions

compared to traditional ﬁxed block-size motion estimation

(FBSME). With FBSME, if an MB consists of two objects with

different motion directions, the coding performance of this MB

is worse. On the other hand, for the same condition, the MB can

be divided into smaller blocks in order to ﬁt the different motion

directions with VBSME. Hence, the coding performance is

improved. VBSME has been adopted in the latest video coding

standards, including H.263 [1], MPEG-4 [2], WMV9.0 [3],

and H.264/AVC [4]. For instance, in H.264/AVC, an MB with

a variable block size can be divided into seven kinds of blocks

including 4

4, 4 8, 8 4, 8 8, 8 16, 16 8, and 16

16. Although VBSME can achieve a higher compression

ratio, it not only requires huge computation complexity but also

increases the difﬁculty of hardware implementation for ME.

Traditional ME hardware architectures are designed for

FBSME, and they can be classiﬁed into two categories. One is

an inter-level architecture, where each processing element (PE)

is responsible for one SAD of a speciﬁc searching candidate, as

shown in (1), and the other is an intra-level architecture, where

each PE is responsible for the distortion of a speciﬁc current

pixel in the current MB for all searching candidates, as shown

CHEN et al.: ANALYSIS AND ARCHITECTURE DESIGN OF VARIABLE BLOCK-SIZE MOTION ESTIMATION FOR H.264/AVC 579

Fig. 1. Hardware architectures of (a) 1DInterYSW, (b) 2DInterYH, and (c) 2DInterLC, where

N

=4

,

P

=2

, and

P

=2

.

in (2). For example, Yang et al. proposed a one-dimensional

(1-D) inter-level semisystolic architecture [5], and Komarek

and Pirsch proposed a two-dimensional (2-D) intra-level sys-

tolic architecture, AB2, in [6].

In this paper, we not only use inter-/intra-level classiﬁca-

tion and various data ﬂows to analyze the impact of supporting

VBSME in different hardware architectures but also propose

two hardware architectures, Propagate Partial SAD and SAD

Tree, that can support VBSME as well as FBSME with less chip

area overhead compared to previous techniques. After analyzing

the impact of supporting VBSME in different hardware archi-

tectures, we discuss the hardware design challenges of integer

motion estimation in H.264/AVC for D1 size as an example.

Because of multiple reference frames and VBSME, integer mo-

tion estimation in H.264/AVC not only requires large computa-

tion complexity but also needs huge memory bandwidth. Based

on the previous analysis, we utilize SAD Tree to design a hard-

ware architecture and reduce the required memory bandwidth

for H.264/AVC integer motion estimation.

The remainder of this paper is organized as follows. In

Section II, six previous hardware architectures of ME are

surveyed ﬁrst, and we propose two hardware architectures for

FBSME. Next, we analyze the impact of supporting VBSME

in different hardware architectures and directly extend the six

previous works and our two architectures to support VBSME.

In Section III, we give an example to provide the quantiﬁed

comparisons of the eight hardware architectures for FBSME

and VBSME, respectively. After that, based on our analysis

results, we develop a hardware architecture for H.264/AVC

integer motion estimation as an example in Section IV. Finally,

a conclusion is given in Section V.

II. I

MPACT OF SUPPORTING VBSME IN DIFFERENT

HARDWARE ARCHITECTURES

In this section, six representative previous works of ME

hardware architectures for FBSME are introduced in the begin-

ning. They are the works of Yang et al.[5], Yeo and Hu [7], Lai

and Chen [8], Komarek and Pirsch [6], Vos and Stegherr [9],

and Hsieh and Lin [10]. These six architectures are signiﬁcant

works, and many hardware architectures are proposed based

on them. For example, reference [11] is the extension of [5].

Reference [12] is proposed based on [9], and reference [13]

combined [10] with multilevel successive elimination algorithm

[14], [15]. Reference [16] is the extension of [6]. Besides pure

inter-/intra-level architectures, there are other kinds of architec-

tures such as AS2 in [6] and a tree-based architecture in [17],

which are hybrids of inter- and intra-level architectures. For the

sake of simplicity, we only discuss the pure inter-/intra-level

architectures, and the others can be easily extended based on

our analysis.

Moreover, we also propose two intra-level hardware archi-

tectures and analyze the impact of supporting VBSME in these

hardware architectures based on various data ﬂows in this sec-

tion. The direct extensions of the eight architectures are also pro-

posed. In the following discussion, we assume that the block size

is

and that the search range is and

in the horizontal and vertical directions.

A. Work of Yang et al.

Yang et al. implemented the ﬁrst VLSI motion estimator in

the world, as shown in Fig. 1(a), which is a 1-D inter-level hard-

ware architecture (1DInterYSW). The number of PEs is equal to

580 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 2, FEBRUARY 2006

Fig. 2. Hardware architectures of (a) 2DIntraVS, (b) 2DIntraKP, and (c) 2DIntraHL, where

N

=4

,

P

=2

, and

P

=2

.

the number of searching candidates in the horizontal direction,

. Reference pixels are broadcasted into all PEs. By selec-

tion signals, the corresponding reference pixel is selected and

inputted into each PE. Current pixels are propagated with prop-

agation registers, and the partial SAD is stored in each PE. In

each cycle, each PE computes the distortion and accumulates

the SAD of a searching candidate. In this architecture, the most

important concept is data broadcasting. With broadcasting tech-

nique, the memory bit width which is deﬁned as the number of

bits for the required reference data in one cycle is reduced sig-

niﬁcantly, although some global routings are required.

B. Work of Yeo and Hu

Fig. 1(b) shows a 2-D inter-level hardware architecture which

is proposed by Yeo and Hu (2DInterYH). 2DInterYH consists

of

PEs and is similar to 1DInterYSW. Reference

pixels are broadcasted into PEs, and current pixels are propa-

gated with propagation registers. The partial SADs are stored

and accumulated in PEs, respectively. Because of broadcasting

reference pixels in both directions, the number of PEs has to

match the MB size. Hence, the search range is partitioned into

regions, and each region is computed by

a set of

PEs. The characteristic of 2DInterYH is broad-

casting in two directions at the same time, which can increase

the data reuse.

C. Work of Lai and Chen

Lai and Chen also proposed another 1-D PE array that im-

plemented a 2-D inter-level architecture with two data-inter-

lacing reference arrays (2DInterLC). The hardware architec-

ture is shown in Fig. 1(c) and is similar to 2DInterYH except

for two aspects. Reference pixels are propagated with propa-

gation registers, and current pixels are broadcast into PEs. The

partial SADs are still stored and accumulated in PEs. Besides,

2DInterLC has to load reference pixels into propagation regis-

ters before computing SADs. The latency of loading reference

pixels can be reduced by partitioning the search range in 2DIn-

terLC. For example, the search range can be partitioned into

parts for a shorter latency.

D. Work of Vos and Stegherr

A 2-D intra-level architecture is proposed by Vos and Stegherr

(2DIntraVS), as shown in Fig. 2(a), where the number of PEs is

equal to the MB size. Each PE corresponds to a current pixel,

and current pixels are stored in PEs, respectively. The important

concept of 2DIntraVS is the scanning order in searching can-

didates, which is known as the snake scan. In order to realize

this, a great deal of propagation registers are used to store refer-

ence pixels, and the data in propagation registers can be shifted

in upward, downward, and right directions. These propagation

registers and the long latency for loading reference pixels are

CHEN et al.: ANALYSIS AND ARCHITECTURE DESIGN OF VARIABLE BLOCK-SIZE MOTION ESTIMATION FOR H.264/AVC 581

Fig. 3. (a) Concept, (b) hardware architecture, and (c) detailed architecture of a PE array with a 1-D adder tree using Propagate Partial SAD architecture, where

N

=4

.

the tradeoffs for the reduction of memory usages. The compu-

tation ﬂow is as follows. First, the distortion is computed in each

PE, and

partial-row SADs are propagated and accumulated in

the horizontal direction. Second, an adder tree is used to accu-

mulate the

-row SADs to be SAD. The accumulations of row

SADs and SAD are done in one cycle. Hence no partial SAD is

required to be stored.

E. Work of Komarek and Pirsch

Komarek and Pirsch contributed a detailed systolic mapping

procedure by the dependence graph (DG). By using different

DGs, including different scheduling and projections, different

systolic hardware architectures can be derived. AB2 (2DIn-

traKP) is a 2-D intra-level architecture, as shown in Fig. 2(b).

Current pixels are stored in corresponding PEs. Reference

pixels are propagated PE by PE in the horizontal direction. The

partial-column SADs are propagated and accumulated in

the vertical direction ﬁrst. After the vertical propagation, these

-column SADs are propagated in the horizontal direction.

In each PE, the distortion of a current pixel in the current MB

is computed and added with the partial-column SAD which is

propagated in PEs from top to bottom in the vertical direction.

In the horizontal propagation, these

-column SADs are

accumulated one by one by

adders and registers.

F. Work of Hsieh and Lin

Hsieh and Lin proposed another 2-D intra-level hardware ar-

chitecture with a search range buffer (2DIntraHL), as shown in

Fig. 2(c). 2DIntraHL consists of

PE arrays in the vertical di-

rection, and each PE array is composed of

PEs in a row. In

2DIntraHL, reference pixels are propagated with propagation

registers one by one, which can provide the advantages of serial

data input and increasing the data reuse. Current pixels are still

stored in PEs. The

partial-column SADs are propagated in the

vertical direction from bottom to up. In each computing cycle,

each PE array generates

distortions of a searching candidate

and accumulates these distortions with

partial-column SADs

in the vertical propagation. After accumulation in the vertical di-

rection,

-column SADs are accumulated in the top adder tree

in one cycle. The longer latency for loading reference pixels and

large propagation registers are the penalties for the reduction of

memory bandwidth and memory bit width.

G. Proposed Propagate Partial SAD

We propose a 2-D intra-level architecture called the Propa-

gate Partial SAD [18]. Fig. 3(a) and (b) shows the concept and

hardware architecture of Propagate Partial SAD, respectively.

The architecture is composed of

PE arrays with a 1-D adder

tree in the vertical direction. Current pixels are stored in each

PE, and two sets of

continuous reference pixels in a row

are broadcasted to

PE arrays at the same time. In each PE

array with a 1-D adder tree,

distortions are computed and

summed by a 1-D adder tree to generate one-row SAD, as shown

in Fig. 3(c). The row SADs are accumulated and propagated

with propagation registers in the vertical direction, as shown in

the right-hand side of Fig. 3(b).

The detailed data ﬂow of Propagate Partial SAD is shown in

Fig. 4. The reference data of searching candidates in the even

and odd columns are inputted by Ref. Pixels 0 and Ref. Pixels 1,

respectively. After initial cycles, the SAD of the ﬁrst searching

candidate in the zeroth column is generated, and the SADs of

the other searching candidates are sequentially generated in the

following cycles. When computing the last

searching can-

didates in each column, the reference data of searching candi-

dates in the next columns begin to be inputted through another

reference input. Then, the hardware utilization is 100% except

582 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 2, FEBRUARY 2006

Fig. 4. Detailed data ﬂow of the proposed Propagate Partial SAD architecture, where

N

=4

and

P

=

P

=2

.

Fig. 5. (a) Concept, (b) hardware architecture, and (c) scan order and memory

access of the proposed SAD Tree architecture, where

N

=4

.

the initial latency. In Propagate Partial SAD, by broadcasting

reference pixel rows and propagating partial-row SADs in the

vertical direction, it provides the advantages of fewer reference

pixel registers and a shorter critical path.

H. Proposed SAD Tree

Fig. 5(a) shows the concept of the proposed SAD Tree archi-

tecture. The proposed SAD Tree is a 2-D intra-level architecture

and consists of a 2-D PE array and one 2-D adder tree with prop-

agation registers, as shown in Fig. 5(b) and (c). Current pixels

are stored in each PE, and reference pixels are stored in propa-

gation registers for data reuse. In each cycle,

current and

reference pixels are inputted to PEs. Simultaneously,

contin-

uous reference pixels in a row are inputted into propagation reg-

isters to update reference pixels. In propagation registers, refer-

ence pixels are propagated in the vertical direction row by row.

In SAD Tree architecture, all distortions of a searching candidate

are generated in the same cycle, and by an adder tree,

distortions are accumulated to derive the SAD in one cycle.

In order to provide a high utilization and data reuse, the snake

scan is adopted and reconﬁgurable data path propagation reg-

isters are developed in the proposed SAD Tree, as shown in

Fig. 5(c), which consists of ﬁve basic steps from A to E. The

ﬁrst step, A, fetches

pixels in a row and the shift direction of

propagation registers is downward. When calculating the last

candidates in a column, one extra reference pixel is required to

be inputted, that is, step B. When ﬁnishing the computation of

one column, the reference pixels in the propagation registers are

shifted left in step C. Because the reference data have already

been stored in the propagation registers, the SAD can be directly

calculated. The next two steps, D and E, are the same as steps A

and B except that the shift direction is upward. After ﬁnishing

the computation of one column in the search range, we execute

step C and then go back to step A. This procedure will iterate

until all searching candidates in the search range have been cal-

culated. The detailed data ﬂow is shown in Fig. 6. By snake scan

and reconﬁgurable propagation registers, the data reuse between

two successive searching candidates can be maximized, and the

hardware utilization is approaching 100%.

I. Impact of Variable Block-Size Motion Estimation in

Hardware Architectures

There are many methods to support VBSME in hardware ar-

chitectures. For example, we can increase the number of PEs

or the operating frequency to do ME for different block sizes,

respectively. One of them is to reuse the SADs of the smallest

blocks, which are the blocks partitioned with the smallest block

size, to derive the SADs of larger blocks. By this method, the

overhead of supporting VBSME is only the slight increase of

gate count, and the other factors, such as frequency, hardware

utilization, memory usage, and so on, are the same as those

of FBSME. When this method is adopted, the circuit for the

SAD calculation is the only difference between FBSME and

VBSME for hardware designs. Hence, the impact of supporting

VBSME in hardware architectures is dependent on the different

data ﬂows of partial SADs. In inter-level architectures, the data

ﬂow of partial SADs is simple, where the partial SADs are

stored in each PE. In intra-level architectures, there are two

Analysis and architecture design of variable block-size motion estimation for H.264/AVC

Figures

Citations

Understanding sources of inefficiency in general-purpose chips

CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network

Robust Video Watermarking of H.264/AVC

An End-to-End Learning Framework for Video Compression

Fast Algorithm and Architecture Design of Low-Power Integer Motion Estimation for H.264/AVC

References

Draft ITU-T recommendation and final draft international standard of joint video specification

Motion compensated inter-frame coding for video conferencing

A family of VLSI designs for the motion compensation block-matching algorithm

Array architectures for block matching algorithms

On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture

Related Papers (5)

Overview of the H.264/AVC video coding standard

Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder

On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture

A new diamond search algorithm for fast block-matching motion estimation

A novel four-step search algorithm for fast block motion estimation