What are the key features that should contribute to this improvement?

The technical features that should contribute to this improvement include allowing motion compensation on 8 8 blocks, extrapolation of motion vectors over picture boundaries, and improved intra coding efficiency.

What was the quantization parameter used to achieve the bit rate shown?

To achieve the target bit rate, a fixed quantization paramater setting was selected that resulted in the bit rate shown (with a change in the quantization parameter setting at one point during the sequence to fine-tune the target rate).

What is the way to compare the performance of the encoder?

In order to make such comparisons between several rate-distortion curves, the curve of the encoder with the poorest performance is used as a common base for comparison against all of the other encoders.

What is the maximum range of integer pixels used by all encoders?

A motion search range of integer pixels was employed by all encoders with the exception of H.263 Baseline, which is constrained by its syntax to a maximum range of integer pixels.

Why did the authors choose to ignore the constraint in their analysis?

The authors have chosen to ignore this constraint in their analysis in order to measure the performance of the underlying technology rather than the confining the analysis only to cases within all limits of the MPEG-4 Visual specification.

How does the Lagrangian mode decision for a macroblock work?

The Lagrangian mode decision for a macroblock proceeds by minimizing(5)where the macroblock mode is varied over the sets of possible macroblock modes for the various standards.

What is the way to achieve the rate distortion performance?

With this in mind, encoders were configured by only including the optional modes from each profile that would produce the best possible rate-distortion performance, while satisfying the low delay and complexity requirements of interactive video applications.

(Open Access) Rate-constrained coder control and comparison of video coding standards (2003) | Thomas Wiegand

688 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003

Rate-Constrained Coder Control and Comparison

of Video Coding Standards

Thomas Wiegand, Heiko Schwarz, Anthony Joch, Faouzi Kossentini, Senior Member, IEEE, and

Gary J. Sullivan, Senior Member, IEEE

Abstract—A unified approach to the coder control of video

coding standards such as MPEG-2, H.263, MPEG-4, and the draft

video coding standard H.264/AVC is presented. The performance

of the various standards is compared by means of PSNR and

subjective testing results. The results indicate that H.264/AVC

compliant encoders typically achieve essentially the same repro-

duction quality as encoders that are compliant with the previous

standards while typically requiring 60% or less of the bit rate.

Index Terms—Coder control, Lagrangian, H.263, H.264/AVC,

MPEG-2, MPEG-4, rate-constrained, standards, video.

I. INTRODUCTION

HE specifications of most video coding standards in-

cluding MPEG-2 Visual [1] , H.263 [2], MPEG-4 Visual

[3] and H.264/AVC [4] provide only the bit-stream syntax and

the decoding process in order to enable interoperability. The

encoding process is left out of the scope to permit flexible

implementations. However, the operational control of the

source encoder is a key problem in video compression. For the

encoding of a video source, many coding parameters such as

macroblock modes, motion vectors, and transform coefficient

levels have to be determined. The chosen values determine the

rate-distortion efficiency of the produced bitstream of a given

encoder.

In this paper, the operational control of MPEG-2, H.263,

MPEG-4, and H.264/AVC encoders is optimized with respect

to their rate-distortion efficiency using Lagrangian optimization

techniques. The optimization is based on [5] and [6], where

the encoder control for the ITU-T Recommendation H.263

[2] is addressed. The Lagrangian coder control as described

in this paper was also integrated into the test models TMN-10

[7] and JM-2 [8] for the ITU-T Recommendation H.263 and

H.264/AVC, respectively. The same Lagrangian coder control

method was also applied to the MPEG-4 verification model

VM-18 [9] and the MPEG-2 test model TM-5 [10]. In addition

to achieving performance gains, the use of similar rate-dis-

Manuscript received December 12, 2001; revised May 10, 2003.

T. Wiegand and H. Schwarz are with the Fraunhofer-Institute for Telecom-

munications, Heinrich-Hertz Institute, 10587 Berlin, Germany (e-mail:

wiegand@hhi.de; hschwarz@hhi.de)

A. Joch is with UB Video Inc., Vancouver, BC V6B 2R9, Canada (e-mail:

anthony@ubvideo.com).

F. Kossentini is with the Department of Electrical and Computer Engi-

neering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada

(e-mail: faouzi@ece.ubc.ca).

G. J. Sullivan is with the Microsoft Corporation, Redmond, WA 98052 USA

(e-mail: garysull@microsoft.com).

Digital Object Identifier 10.1109/TCSVT.2003.815168

tortion optimization methods in all encoders allows a useful

comparison between the encoders in terms of coding efficiency.

This paper is organized as follows. Section II gives an

overview of the syntax features of MPEG-2 Video, H.263,

MPEG-4 Visual, and H.264/AVC. The rate-distortion-opti-

mized codercontrolis describedin Section III,and experimental

results are presented in Section IV.

II. S

TANDARD SYNTAX AND DECODERS

All ITU-T and ISO/IEC JTC1 standards since H.261 [11]

have incommon that theyare based onthe so-called block-based

hybrid video coding approach. The basic source-coding algo-

rithm is a hybrid of inter-picture prediction to utilize temporal

redundancy and transform coding of the prediction error signal

to reduce spatial redundancy. Each picture of a video signal

is partitioned into fixed-size macroblocks of 16

16 samples,

which can be transmitted in one of several coding modes de-

pending on the picture or slice coding type. Common to all

standards is the definition of INTRA coded pictures or I-pic-

tures. In I-pictures, all macroblocks are coded without refer-

ring to other pictures in the video sequence. Also common is

the definition of predictive-coded pictures, so-called P-pictures

and B-pictures, with the latter being extended conceptually in

H.264/AVC coding. In predictive-coded pictures, typically one

of a variety of INTER coding modes can be chosen to encode

each macroblock.

In order to manage the large number of coding tools included

in standards and the broad range of formats and bit rates sup-

ported, the concept of profiles and levels is typically employed

to define a set of conformance points, each targeting a specific

class of applications. These conformance points are designed

to facilitate interoperability between various applications of the

standard that have similar functional requirements. A profile de-

fines a set of coding tools or algorithms that can be used in gen-

erating a compliant bitstream, whereas a level places constraints

on certain key parameters of the bitstream, such as the picture

resolution and bit rate.

Although MPEG-2, H.263, MPEG-4, and H.264/AVC de-

fine similar coding algorithms, they contain features and en-

hancements that make them differ. These differences mainly

involve the formation of the prediction signal, the block sizes

used for transform coding, and the entropy coding methods. In

the following, the description of the various standards is limited

to those features relevant to the comparisons described in this

paper.

WIEGAND et al.: RATE-CONSTRAINED CODER CONTROL AND COMPARISON OF VIDEO CODING STANDARDS 689

A. ISO/IEC Standard 13818-2/ITU-T Recommendation

H.262: MPEG-2

MPEG-2 forms the heart of broadcast-quality digital tele-

vision for both standard-definition and high-definition televi-

sion (SDTV and HDTV) [1], [12], [13]. MPEG-2 video (IS

13 818-2/ITU-T Recommendation H.262) was designed to en-

compass MPEG-1 [14] and to also provide high quality with

interlaced video sources at bit rates in the range of 4–30 Mbit/s.

Although usually thought of as an ISO standard, MPEG-2 video

was developed as an official joint project of both the ISO/IEC

JTC1 and ITU-T organizations, and was completed in late 1994.

MPEG-2 incorporates various features from H.261 and

MPEG-1. It uses the basic coding structure that is still predomi-

nant today. For each macroblock, which consists of one 16

luminance block and two 8

8 chrominance blocks for 4:2:0

formatted video sequences, a syntax element indicating the

macroblock coding mode (and signalling a quantizer change)

is transmitted. While all macroblocks of I-pictures are coded

in INTRA mode, macroblocks of P-pictures can be coded in

INTRA, INTER-16

16, or SKIP mode. For the SKIP mode,

runs of consecutive skipped macroblocks are transmitted and

the representation of the picture in the skipped region is rep-

resented using INTER prediction without adding any residual

difference representation. In B-pictures, the prediction signal

for the motion-compensated INTER-16

16 mode can be

formed by forward, backward, or bidirectionally interpolated

prediction. The motion compensation is generally based on

16 blocks and utilizes half-pixel accurate motion vectors,

with bilinear interpolation of half-pixel positions. The motion

vectors are predicted from a single previously encoded motion

vector in the same slice.

Texture coding is conducted using a DCT on blocks of 8

samples, and uniform scalar quantization (with the exception

of the central dead-zone) is applied that can be adjusted using

quantization values from 2 to 62. Additionally, a perceptually

weighted matrix based on the frequency of each transform co-

efficient (except the Intra DC coefficient) can be used. The en-

tropy coding is performed using zig-zag scanning and two-di-

mensional run-level variable-length coding (VLC). There are

two available VLC tables for transmitting the transform coef-

ficient levels, of which one must be used for predictive-coded

macroblocks and either can be used for INTRA macroblocks,

as selected by the encoder on the picture level.

For the coding of interlaced video sources, MPEG-2 pro-

vides the concept of field pictures and field-coded macroblocks

in frame pictures. The top and bottom field of an interlaced

frame can be coded together as frame picture or as two sepa-

rate field pictures. In addition to the macroblock coding modes

described above, field-picture macroblocks can also be coded in

INTER-16

8 prediction mode, in which two different predic-

tion signals are used, one for the upper and one for the lower half

of a macroblock. For macroblocks in frame pictures, a similar

coding mode is provided that uses different prediction signals

for the top and bottom field lines of a macroblock. Macroblocks

of both field and frame pictures can also be transmitted in dual

prime mode. In this coding mode, the final prediction for each

field is formed by averaging two prediction signals, of which

one is obtained by referencing the field with the same parity

and the other is obtained by referencing the field with the oppo-

site parity as the current field. For coding of the residual data,

MPEG-2 provides the possibility to use an alternative scanning

pattern, which can be selected on picture level, and to choose

between a frame- and field-based DCT coding of the prediction

error signal.

The most widely implemented conformance point in the

MPEG-2 standard is the Main profile at the Main Level

(MP@ML). MPEG-2 MP@ML compliant encoders find

application in DVD-video, digital cable television, terrestrial

broadcast of standard definition television, and direct-broadcast

satellite (DBS) systems. This conformance point supports

coding of CCIR 601 content at bit rates up to 15 Mbit/s and

permits use of B-pictures and interlaced prediction modes. In

this work, an MPEG-2 encoder is included in the comparisons

of video encoders for streaming and entertainment applications.

The MPEG-2 bitstreams generated for our comparisons are

compliant with the popular MP@ML conformance point with

exception of the HDTV bitstreams, which are compliant with

the MP@HL conformance point.

B. ITU-T Recommendation H.263

The first version of ITU-T Recommendation H.263 [2]

defines a basic source-coding algorithm similar to that of

MPEG-2, utilizing the INTER-16

16, INTRA, and SKIP

coding modes. But H.263 Baseline contains significant changes

that make it more efficient at lower bit rates including median

motion vector prediction and three-dimensional run-level-last

VLC with tables optimized for lower bit rates.

Moreover, version 1 of H.263 contains eight Annexes (An-

nexes A–G) including four Annexes permitting source coding

options (Annexes D, E, F,and G) for improvedcompression per-

formance. Annexes D and F are in frequent use today. Annex D

specifies the option for motion vectors to point outside the ref-

erence picture and to have longer motion vectors than H.263

Baseline. Annex F specifies the use of overlapped block mo-

tion compensation and four motion vectors per macroblock with

each motion vector assigned to an 8

8 subblock, i.e., the use of

variable block sizes. Hence, an INTER-8

8 coding mode is

added to the set of possible macroblock modes.

H.263+ is the second version of H.263 [2], [15], where sev-

eral optional features are added to H.263 as Annexes I through

T. Annex J of H.263+ specifies a deblocking filter that is ap-

plied inside the motion prediction loop and is used together with

the variable block-size feature of Annex F. H.263+ also adds

some improvements in compression efficiency for the INTRA

macroblock mode through prediction of intra-DCT transform

coefficients from neighboring blocks and specialized quantiza-

tion and VLC coding methods for intra coefficients. This ad-

vanced syntax is described in Annex I of the ITU-T Recom-

mendation H.263+. Annex I provides significant rate-distortion

improvementsbetween 1 and 2 dB compared to the H.263 Base-

line INTRA macroblock coding mode when utilizing the same

amount of bits for both codecs [15]. Annex Tof H.263+ removes

some limitations of the Baseline syntax in terms of quantization

and also improves chrominance fidelity by specifying a smaller

690 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003

step size for chrominance coefficients than for luminance. The

remaining Annexes contain additional functionalities including

specifications for custom and flexible video formats, scalability,

and backward-compatible supplemental enhancement informa-

tion.

A second set of extensions that adds three more optional

modes to H.263 [2] was completed and approved late in the

year 2000. This version is often referred to as H.263++. The

data partitioned slice mode (Annex V) can provide enhanced

resilience to bit-stream corruption, which typically occurs

during transmission over wireless channels, by separating

header and motion vector information from transform coef-

ficients. Annex W specifies additional backward-compatible

supplemental enhancement information including interlaced

field indications, repeated picture headers, and the indication

of the use of a specific fixed-point inverse DCT. Compression

efficiency and robustness to packet loss can be improved by

using the enhanced reference picture selection mode (Annex

U), which enables long-term memory motion compensation

[22], [23]. In this mode, the spatial displacement vectors that

indicate motion-compensated prediction blocks are extended

by variable time delay, permitting the predictions to originate

from reference pictures other than the most recently decoded

reference picture. Motion-compensation performance is im-

proved because of the larger number of possible predictions

that are available by including more reference frames in the

motion search. In Annex U, two modes are available for the

buffering of reference pictures. The sliding-window mode—in

which only the most recent reference pictures are stored—is

the simplest and most commonly implemented mode. In the

more flexible adaptive buffering mode, buffer management

commands can be inserted into the bitstream as side informa-

tion, permitting an encoder to specify how long each reference

picture remains available for prediction, with a constraint on

the total size of the picture buffer. The maximum number of

reference pictures is typically 5 or 10 when conforming to one

of H.263’s normative profiles, which are discussed next.

The ITU-T has recently approved Annex X of H.263, which

provides a normative definition of profiles, or preferred combi-

nations of optional modes, and levels, which specify maximum

values for several key parameters of an H.263 bitstream. Similar

to their use in MPEG-2, each profile is designed to target a spe-

cific key application, or group of applications that require sim-

ilar functionality. In this work, the rate-distortion capabilities

of the Baseline profile and the Conversational High Compres-

sion (CHC) profile are compared to other standards for use in

videoconferencing applications. The Baseline profile supports

only Baseline H.263 syntax (i.e., no optional modes) and exists

to provide a profile designation to the minimal capability that

all compliant decoders must support. The CHC profile includes

most of the optional modes that provide enhanced coding effi-

ciency without the added delay that is introduced by B-pictures

and without any optional error resilience features. Hence, it is

the best profile to demonstrate the optimal rate-distortion capa-

bilities of the H.263 standard for use in interactive video appli-

cations. Additionally, the High-Latency (HL) profile of H.263,

which adds support for B-pictures to the coding efficiency tools

of the CHC profile, is included in the comparison of encoders

for streaming applications, in which the added delay introduced

by B-pictures is acceptable.

C. ISO/IEC Standard 14496-2: MPEG-4

MPEG-4 Visual [3] standardizes efficient coding methods for

many types of audiovisual data, including natural video con-

tent. For this purpose, MPEG-4 Visual uses the Baseline H.263

algorithm as a starting point so that all compliant MPEG-4 de-

coders must be able to decode any valid Baseline H.263 bit-

stream. However, MPEG-4 includes several additional features

that can improve coding efficiency.

While spatial coding in MPEG-4 uses the 8

8 DCT and

scalar quantization, MPEG-4 supports two different scalar

quantization methods that are referred to as MPEG-style and

H.263-style. In the MPEG-style quantization, perceptually

weighted matrices, similar to those used in MPEG-2 assign

a specific quantizer to each coefficient in a block, whereas

in the H.263 method, the same quantizer is used for all ac

coefficients. Quantization of DC coefficients uses a special

nonlinear scale that is a function of the quantization parameter.

Quantized coefficients are scanned in a zig-zag pattern and

assigned run-length codes, as in H.263. MPEG-4 also includes

alternate scan patterns for horizontally and vertically predicted

INTRA blocks and the use of a separate VLC table for INTRA

coefficients. These techniques are similar to those defined in

Annex I of H.263.

Motion compensation in MPEG-4 is based on 16

16 blocks

and supports variable block sizes, as in Annex F of H.263,

so that one motion vector can be specified for each of the

8 subblocks of a macroblock, permitting the use of the

INTER-8

8 mode. Version 1 of MPEG-4 supports onlymotion

compensation at half-pixel accuracy, with bilinear interpolation

used to generate values at half-pixel positions. Version 2 of

MPEG-4 additionally supports the use of quarter-pixel accurate

motion compensation,with a windowed 8-tap sincfunction used

to generate half-pixel positions and bilinear interpolation for

quarter-pixel positions. Motion vectors are permitted to point

outside the reference picture and are encoded differentially

after median prediction, according to H.263. MPEG-4 does

not include a normative de-blocking filter inside the motion

compensation loop, as in Annex J of H.263, but post filters

may be applied to the reconstructed output at the decoder

to improve visual quality.

The MPEG-4 Simple profile includes all features mentioned

above, with the exception of the MPEG-style quantization

method and quarter-pixel motion compensation. The Advanced

Simple profile adds these two features, plus B-pictures, global

motion compensation (GMC) and special tools for efficient

coding of interlaced video. A video coder compliant with the

Simple profile and the Advanced Simple profile will be used

in our experiments.

D. ITU-T Recommendation H.264/ISO/IEC Standard

14496-10 AVC: H.264/AVC

H.264/AVC [4] is the latest joint project of the ITU-T VCEG

and ISO/IEC MPEG. The H.264/AVC design covers a video

coding layer (VCL) and a Network Adaptation Layer (NAL).

WIEGAND et al.: RATE-CONSTRAINED CODER CONTROL AND COMPARISON OF VIDEO CODING STANDARDS 691

Although the VCL design basically follows the design of prior

video coding standards such as MPEG-2, H.263, and MPEG-4,

it contains newfeatures that enable it to achieve a significant im-

provement in compression efficiency in relation to prior coding

standards. For details, please refer to [16]. Here, we will give a

very brief description of the necessary parts of H.264/AVC in

order to make the paper more self-contained.

In H.264/AVC, blocks of 4

4 samples are used for transform

coding, and thus a macroblock consists of 16 luminance and 8

chrominance blocks. Similar tothe I-, P-, and B-pictures defined

for MPEG-2, H.263, and MPEG-4, the H.264/AVC syntax sup-

ports I-, P-, and B-slices. A macroblock can always be coded in

one of several INTRA coding modes. There are two classes of

INTRA coding modes, which are denoted as INTRA-16

and INTRA-4

4 in the following. In contrast to previous stan-

dards where only some of the DCT-coefficients can be predicted

from neighboring INTRA-blocks, in H.264/AVC, prediction is

always utilized in the spatial domain by referring to neighboring

samples of already coded blocks. When using the INTRA-4

mode, each 4

4 block of the luminance component utilizes

one of nine prediction modes. The chosen modes are trans-

mitted as side information. With the INTRA-16

16 mode, a

uniform prediction is performed for the whole luminance com-

ponent of a macroblock. Four prediction modes are supported in

the INTRA-16

16 mode. For both classes of INTRA coding

modes, the chrominance components are predicted using one of

four possible prediction modes.

In addition to the INTRA modes, H.264/AVC provides var-

ious other motion-compensated coding modes for macroblocks

in P-slices. Each motion-compensated mode corresponds to

a specific partition of the macroblock into fixed size blocks

used for motion description. Macroblock partitions with block

sizes of 16

16, 16 8, 8 16, and 8 8 luminance samples are

supported by the syntax corresponding to the INTER-16

16,

INTER-16

8, INTER-8 16, and INTER-8 8 macroblock

modes, respectively. In case the INTER-8

8 macroblock

mode is chosen, each of the 8

8 submacroblocks can be

further partitioned into blocks of 8

8, 8 4, 4 8, or 4 4 lu-

minance samples. H.264/AVC generally supports multi-frame

motion-compensated prediction. That is, similar to Annex

U of H.263, more than one prior coded picture can be used

as reference for the motion compensation. In H.264/AVC,

motion compensation is performed with quarter-pixel accurate

motion vectors. Prediction values at half-pixel locations are

obtained by applying a one-dimensional six-tap finite impulse

response (FIR) filter in each direction requiring a half-sample

offset (horizontal or vertical or both, depending on the value

of the motion vector), and prediction values at quarter-pixel

locations are generated by averaging samples at the integer-

and half-pixel positions. The motion vector components are

differentially coded using either median or directional predic-

tion from neighboring blocks.

In comparison to MPEG-2, H.263, and MPEG-4, the concept

of B-slices is generalized in H.264/AVC. For details please refer

to [17]. B-slices utilize two distinct reference picture lists, and

four different types of INTER prediction are supported: list 0,

list 1, bi-predictive, and direct prediction. While list 0 predic-

tion indicates that the prediction signal is formed by motion

compensation from a picture of the first reference picture list,

a picture of the second reference picture list is used for building

the prediction signal if list 1 prediction is used. In the bi-pre-

dictive mode, the prediction signal is formed by a weighted

average of a motion-compensated list 0 and list 1 prediction

signal. The direct prediction mode differs from the one used in

H.263 andMPEG-4 in that no delta motion vectoris transmitted.

Furthermore, there are two methods for obtaining the predic-

tion signal, referred to as temporal and spatial direct prediction,

which can be selected by an encoder on the slice level. B-slices

utilize a similar macroblock partitioning to P-slices. Besides the

INTER-16

16, INTER-16 8, INTER-8 16, INTER-8 8

and the INTRA modes, a macroblock mode that utilizes direct

prediction, the DIRECT mode, is provided. Additionally, for

each 16

16, 16 8, 8 16, and 8 8 partition, the prediction

method (list 0, list 1, bi-predictive) can be chosen separately.

An 8

8 partition of a B-slice macroblock can also be coded in

DIRECT-8

8 mode. If no prediction error signal is transmitted

for a DIRECT macroblock mode, it is also referred to as B-slice

SKIP mode.

H.264/AVC is basically similar to prior coding standards in

that it utilizes transform coding of the prediction error signal.

However, in H.264/AVC the transformation is applied to 4

blocks and, instead of the DCT, H.264/AVC uses a separable

integer transform with basically the same properties as a 4

DCT. Since the inverse transform is defined by exact integer

operations, inverse-transform mismatches are avoided. An ad-

ditional 2

2 transform is applied to the four DC coefficients of

each chrominance component. If the INTRA 16

16-mode is in

use, a similar operation extending the length of the transform

basis functions is performed on the 4

4 DC coefficients of the

luminance signal.

For the quantization of transform coefficients, H.264/AVC

uses scalar quantization, but without an extra-wide dead-zone

around zero as found in H.263 and MPEG-4. One of 52 quan-

tizers is selected for each macroblock by the quantization pa-

rameter

. The quantizers are arranged in a way that there is

an increase of approximately 12.5% in quantization step size

when incrementing

by one. The transform coefficient levels

are scanned in a zig-zag fashion if the block is part of a mac-

roblock coded in frame mode; for field-mode macroblocks, an

alternative scanning pattern is used. The 2

2 DC coefficients of

the chrominance components are scanned in raster-scan order.

All syntax elements of a macroblock including the vectors of

scanned transform coefficient levels are transmitted by entropy

coding methods.

Two methods of entropy coding are supported by

H.264/AVC. The default entropy coding method uses a

single infinite-extend codeword set for all syntax elements

except the residual data. The vectors of scanned transform

coefficient levels are transmitted using a more sophisticated

method called context-adaptive VLC (CAVLC). This scheme

basically uses the concept of run-length coding as it is found

in MPEG-2, H.263, and MPEG-4; however, VLC tables for

various syntax elements are switched depending on the values

of previously transmitted syntax elements. Since the VLC

tables are well designed to match the corresponding conditional

statistics, the entropy coding performance is improved in com-

692 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003

parison to schemes using a single VLC table. The efficiency of

entropy coding can be improved further if the context-adaptive

binary arithmetic coding (CABAC) is used. On the one hand,

the usage of arithmetic coding allows the assignment of a

noninteger number of bits to each symbol of an alphabet,

which is extremely beneficial for symbol probabilities much

greater than 0.5. On the other hand, the usage of adaptive codes

permits adaptation to nonstationary symbol statistics. Another

important property of CABAC is its context modeling. The

statistics of already coded syntax elements are used to estimate

conditional probabilities of coding symbols. Inter-symbol

redundancies are exploited by switching several estimated

probability models according to already coded symbols in

the neighborhood of the symbol to encode. For details about

CABAC, please refer to [18].

For removing block-edge artifacts, the H.264/AVC design in-

cludes a de-blocking filter, which is applied inside the motion

prediction loop. The strength of filtering is adaptively controlled

by the values of several syntax elements.

Similar to MPEG-2, a frame of interlaced video can be coded

as a single frame picture or two separate field pictures. Addi-

tionally, H.264/AVC supports a macroblock-adaptive switching

between frame and field coding. Therefore, a pair of vertically

adjacent macroblocks is considered as a coding unit, which can

be either transmitted as two frame macroblocks or a top and a

bottom field macroblock.

In H.264/AVC, three profiles are defined. The Baseline pro-

file includes all described features except B-slices, CABAC,

and the interlaced coding tools. Since the main target appli-

cation area of the Baseline profile is the interactive transmis-

sion of video, it is used in the comparison of video encoders

for videoconferencing applications. In the comparison for video

streaming and entertainment applications, which allow a larger

delay, the Main profile of H.264/AVC is used. The Main profile

adds support for B-slices, the highly efficient CABAC entropy

coding method, as well as the interlaced coding tools.

III. V

IDEO CODER CONTROL

One key problem in video compression is the operational

control of the source encoder. This problem is compounded

because typical video sequences contain widely varying con-

tent and motion, necessitating the selection between different

coding options with varying rate-distortion efficiency for dif-

ferent parts of the image. The task of coder control is to deter-

mine a set of coding parameters, and thereby the bitstream, such

that a certain rate-distortion trade-off is achieved for a given de-

coder. This article focuses on coder control algorithms for the

case of error-free transmission of the bitstream. For a discus-

sion of the application of coder control algorithms in the case of

error-prone transmission, please refer to [19]. A particular em-

phasis is on Lagrangian bit-allocation techniques, which have

emerged to form the most widely accepted approach in recent

standard development. The popularity of this approach is due

to its effectiveness and simplicity. For completeness, we briefly

review the Lagrangian optimization techniques and their appli-

cation to video coding.

A. Optimization Using Lagrangian Techniques

Consider

source samples that are collected in the -tuple

. A source sample can be a scalar or

vector. Each source sample

can be quantized using several

possible coding options that are indicated by an index out of the

set

. Let be the selected index

to code

. Then the coding options assigned to the elements in

are given by the components in the -tuple .

The problem of finding the combination of coding options that

minimizes the distortion for the given sequence of source sam-

ples subject to a given rate constraint

can be formulated as

(1)

Here,

and represent the total distortion and rate,

respectively, resulting from the quantization of

with a partic-

ular combination of coding options

. In practice, rather than

solving the constrained problem in (1), an unconstrained for-

mulation is employed, that is

(2)

and

being the Lagrange parameter. This unconstrained

solution to a discrete optimization problem was introduced by

Everett [20]. The solution

to (2) is optimal in the sense that

if a rate constraint

corresponds to , then the total distortion

is minimum for all combinations of coding options

with bit rate less or equal to

We can assume additive distortion and rate measures, and let

these two quantities be only dependent on the choice of the pa-

rameter corresponding to each sample. Then, a simplified La-

grangian cost function can be computed using

(3)

In this case, the optimization problem in (3) reduces to

(4)

and can be easily solved by independently selecting the coding

option for each

. Forthis particular scenario, the problem

formulation is equivalent to the bit-allocation problem for an ar-

bitrary set of quantizers, proposed by Shoham and Gersho [21].

B. Lagrangian Optimization in Hybrid Video Coding

The application of Lagrangian techniques to control a hybrid

video coder is not straightforward because of temporal and spa-

tial dependencies of the rate-distortion costs. Consider a block-

based hybrid video codec such as H.261, H.263, H.264/AVC

or MPEG-1/2/4. Let the image sequence

be partitioned into

distinct blocks and the associated pixels be given as .

The options

to encode each block are categorized into

INTRA and INTER, i.e., predictive coding modes with associ-

ated parameters. The parameters are transform coefficients and

quantizer value

for both modes plus one or more motion

vectors for the INTER mode. The parameters for both modes

Rate-constrained coder control and comparison of video coding standards

Figures

Citations

Overview of the H.264/AVC video coding standard

Overview of the High Efficiency Video Coding (HEVC) Standard

Overview of the Scalable Video Coding Extension of the H.264/AVC Standard

The Scalable Video Coding Extension of the H.264/AVC Standard

Scope of validity of PSNR in image/video quality assessment

References

Overview of the H.264/AVC video coding standard

Rate-distortion optimization for video compression

Draft ITU-T recommendation and final draft international standard of joint video specification

Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard

Video coding for low bitrate communication

Related Papers (5)

Overview of the H.264/AVC video coding standard

Rate-distortion optimization for video compression

Calculation of average PSNR differences between RD-curves

Overview of the Scalable Video Coding Extension of the H.264/AVC Standard

Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Rate-constrained coder control and comparison of video coding standards" ?

Q2. What are the key features that should contribute to this improvement?

Q3. What is the common mode for the coding of reference pictures?

Q4. What is the common mode for a buffering of reference pictures?

Q5. What was the quantization parameter used to achieve the bit rate shown?

Q6. What are the common applications of MPEG-2 encoders?

Q7. What is the way to compare the performance of the encoder?

Q8. What is the maximum range of integer pixels used by all encoders?

Q9. Why did the authors choose to ignore the constraint in their analysis?

Q10. What are the other Annexes that are included in H.263?

Q11. How does the Lagrangian mode decision for a macroblock work?

Q12. What is the way to achieve the rate distortion performance?