scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Overview of the H.264/AVC video coding standard

TL;DR: An overview of the technical features of H.264/AVC is provided, profiles and applications for the standard are described, and the history of the standardization process is outlined.
Abstract: H.264/AVC is newest video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goals of the H.264/AVC standardization effort have been enhanced compression performance and provision of a "network-friendly" video representation addressing "conversational" (video telephony) and "nonconversational" (storage, broadcast, or streaming) applications. H.264/AVC has achieved a significant improvement in rate-distortion efficiency relative to existing standards. This article provides an overview of the technical features of H.264/AVC, describes profiles and applications for the standard, and outlines the history of the standardization process.

Summary (5 min read)

Introduction

  • The new standard is designed for technical solutions including at least the following application areas Broadcast over cable, satellite, cable modem, DSL, terrestrial, etc. Interactive or serial storage on optical and magnetic devices, DVD, etc. Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile networks, modems, etc. or mixtures of these.
  • Multiple reference picture motion compensation:Predictively coded pictures (called “P” pictures) in MPEG-2 and its predecessors used only one previous picture to predict the values in an incoming picture.
  • In addition to improved prediction methods, other parts of the design were also enhanced for improved coding efficiency, including the following.
  • Robustness to data errors/losses and flexibility for operation over a variety of network environments is enabled by a number of design aspects new to the H.264/AVC standard, including the following highlighted features.

A. NAL Units

  • The coded video data is organized into NAL units, each of which is effectively a packet that contains an integer number of bytes.
  • The first byte of each NAL unit is a header byte that contains an indication of the type of data in the NAL unit, and the remaining bytes contain payload data of the type indicated by the header.
  • The payload data in the NAL unit is interleaved as necessary with emulation prevention bytes, which are bytes inserted with a specific value to prevent a particular pattern of data called a start code prefixfrom being accidentally generated inside the payload.
  • The NAL unit structure definition specifies a generic format for use in both packet-oriented and bitstream-oriented transport systems, and a series of NAL units generated by an encoder is referred to as a NAL unit stream.

B. NAL Units in Byte-Stream Format Use

  • Some systems (e.g., H.320 and MPEG-2/H.222.0 systems) require delivery of the entire or partial NAL unit stream as an ordered stream of bytes or bits within which the locations of NAL unit boundaries need to be identifiable from patterns within the coded data itself.
  • For use in such systems, the H.264/AVC specification defines a byte stream format.
  • In the byte stream format, each NAL unit is prefixed by a specific pattern of three bytes called a start code prefix.
  • The use of emulation prevention bytes guarantees that start code prefixes are unique identifiers of the start of a new NAL unit.
  • A small amount of additional data (one byte per video picture) is also added to allow decoders that operate in systems that provide streams of bits without alignment to byte boundaries to recover the necessary alignment from the data in the stream.

C. NAL Units in Packet-Transport System Use

  • In other systems (e.g., internet protocol/RTP systems), the coded data is carried in packets that are framed by the system transport protocol, and identification of the boundaries of NAL units within the packets can be established without use of start code prefix patterns.
  • In such systems, the inclusion of start code prefixes in the data would be a waste of data carrying capacity, so instead the NAL units can be carried in data packets without start code prefixes.
  • D. VCL and Non-VCL NAL Units NAL units are classified into VCL and non-VCL NAL units.

E. Parameter Sets

  • There are two types of parameter sets: sequence parameter sets, which apply to a series of consecutive coded video pictures called a coded video sequence; picture parameter sets, which apply to the decoding of one or more individual pictures within a coded video sequence.
  • The sequence and picture parameter-set mechanism decouples the transmission of infrequently changing information from the transmission of coded representations of the values of the samples in the video pictures.
  • Each VCL NAL unit contains an identifier that refers to the content of the relevant picture parameter set and each picture parameter set contains an identifier that refers to the content of the relevant sequence parameter set.
  • In some applications, parameter sets may be sent within the channel that carries the VCL NAL units (termed “in-band” transmission).
  • In other applications (see Fig. 3), it can be advantageous to convey the parameter sets “out-of-band” using a more reliable transport mechanism than the video channel itself.

F. Access Units

  • The decoding of each access unit results in one decoded picture.
  • Each access unit contains a set of VCL NAL units that together compose aprimary coded picture.
  • It may also be prefixed with an access unit delimiterto aid in locating the start of the access unit.
  • Following the primary coded picture may be some additional VCL NAL units that contain redundant representations of areas of the same video picture.
  • These are referred to asredundant coded pictures, and are available for use by a decoder in recovering from loss or corruption of the data in the primary coded pictures.

G. Coded Video Sequences

  • A coded video sequence consists of a series of access units that are sequential in the NAL unit stream and use only one sequence parameter set.
  • Each coded video sequence can be decoded independently of any other coded video sequence, given the necessary parameter set information, which may be conveyed “in-band” or “out-of-band”.
  • At the beginning of a coded video sequence is ani stantaneous decoding refresh(IDR) access unit.
  • As in all prior ITU-T and ISO/IEC JTC1 video standards since H.261 [3], the VCL design follows the so-called blockbased hybrid video coding approach (as depicted in Fig. 8), in which each coded picture is represented in block-shaped units of associated luma and chroma samples calledmacroblocks.
  • The basic source-coding algorithm is a hybrid of inter-picture prediction to exploit temporal statistical dependencies and transform coding of the prediction residual to exploit spatial statistical dependencies.

A. Pictures, Frames, and Fields

  • A coded picture in [1] can represent either an entireframeor a singlefield, as was also the case for MPEG-2 video.
  • The top field contains even-numbered rows 0, 2,…,H/2–1 with H being the number of rows of the frame.
  • The bottom field contains the odd-numbered rows (starting with the second line of the frame).
  • Is primarily agnostic with respect to this video characteristic, i.e., the underlying interlaced or progressive timing of the original captured pictures.
  • Instead, its coding specifies a representation based primarily on geometric concepts rather than being based on timing.

B. YCbCr Color Space and 4:2:0 Sampling

  • The human visual system seems to perceive scene content in terms of brightness and color information separately, and with greater sensitivity to the details of brightness than color.
  • Video transmission systems can be designed to take advantage of this.
  • The video color space used by H.264/AVC separates a color representation into three components called Y, Cb, and Cr. Component Y is calledluma, and represents brightness.
  • This is called 4:2:0 sampling with 8 bits of precision per sample.
  • (Proposals for extension of the standard to also support higher-resolution chroma and a larger number of bits per sample are currently being considered.).

C. Division of the Picture Into Macroblocks

  • A picture is partitioned into fixed-size macroblocks that each cover a rectangular picture area of 1616 samples of the luma component and 88 samples of each of the two chroma components.
  • This partitioning into macroblocks has been adopted into all previous ITU-T and ISO/IEC JTC1 video coding standards since H.261 [3].
  • Macroblocks are the basic building blocks of the standard for which the decoding process is specified.
  • The basic coding algorithm for a macroblock is described after the authors explain how macroblocks are grouped into slices.

D. Slices and Slice Groups

  • Slices are a sequence of macroblocks which are processed in the order of a raster scan when not using FMO which is described in the next paragraph.
  • The macroblock to slice group map consists of a slice group identification number for each macroblock in the picture, specifying which slice group the associated macroblock belongs to.
  • Using FMO, a picture can be split into many macroblock scanning patterns such as interleaved slices, a dispersed macroblock allocation, one or more “foreground” slice groups and a “leftover” slice group, or a checker-board type of mapping.
  • The latter two are illustrated in Fig. 7.
  • In addition to the coding types available in a P slice, some macroblocks of the B slice can also be coded using inter prediction withtwo motion-compensated prediction signals per prediction block.

E. Encoding and Decoding Process for Macroblocks

  • All luma and chroma samples of a macroblock are either spatially or temporally predicted, and the resulting prediction residual is encoded using transform coding.
  • For transform coding purposes, each color component of the prediction residual signal is subdivided into smaller 4 blocks.
  • Each block is transformed using an integer transform, and the transform coefficients are quantized and encoded using entropy coding methods.
  • The input video signal is split into macroblocks, the association of macroblocks to slice groups and slices is selected, and then each macroblock of each slice is processed as shown.
  • An efficient parallel processing of macroblocks is possible when there are various slices in the picture.

F. Adaptive Frame/Field Coding Operation

  • In interlaced frames with regions of moving objects or camera motion, two adjacent rows tend to show a reduced degree of statistical dependency when compared to progressive frames in.
  • Therefore, the frame/field encoding decision can also be made independently for each vertical pair of macroblocks (a 16 32 luma region) in a frame.
  • Prediction mode 0 (vertical prediction), mode 1 (hor- izontal prediction), and mode 2 (DC prediction) are specified similar to the modes in Intra_44 prediction except that instead of 4 neighbors on each side to predict a 44 block, 16 neighbors on each side to predict a 1616 block are used.
  • That is, more than one prior coded picture can be used as reference for motion-compensated prediction.

I. Transform, Scaling, and Quantization

  • Similar to previous video coding standards, H.264/AVC utilizes transform coding of the prediction residual.
  • The smaller transform requires less computations and a smaller processing wordlength.
  • Involves only adds and shifts, it is also specified such that mismatch between encoder and decoder is avoided (this has been a problem with earlier 8 8 DCT standards).
  • A quantization parameter isused for determining the quantiza-.

L. Hypothetical Reference Decoder

  • One of the key benefits provided by a standard is the assurance that all the decoders compliant with the standard will be able to decode a compliant compressed video.
  • Performance of the deblocking filter for highly compressed pictures (a) without deblocking filter and (b) with deblocking filter.
  • Specifying input and output buffer models and developing an implementation independent model of a receiver achieves this.
  • In H.264/AVC HRD specifies operation of two buffers: 1) the coded picture buffer (CPB) and 2) the decoded picture buffer (DPB).
  • CPB models the arrival and removal time of the coded bits.

A. Profiles and Levels

  • These conformance points are designed to facilitate interoperability between various applications of the standard that have similar functional requirements.
  • The Baseline profile supports all features in H.264/AVC.
  • SP/SI slices, and slice data partitioning, also known as Set 2.
  • The first set of additional features is supported by the Main profile.

B. Areas for the Profiles of the New Standard to be Used

  • The increased compression efficiency of H.264/AVC offers to enhance existing applications or enables new applications.
  • In TML-7/8, 1/8-sample accurate motion compensation was introduced which was then dropped for complexity reasons in JM-5. cients).
  • Since then, he has published several conference and journal papers on the subject and has contributed successfully to the ITU-T Video Coding Experts Group (ITU-T SG16 Q.6—VCEG)/ISO/IEC Moving Pictures Experts Group (ISO/IEC JTC1/SC29/WG11—MPEG)/Joint Video Team (JVT) standardization efforts and holds various international patents in this field.
  • Since 2002, he has been a Principal Scientist at Tandberg Telecom, Lysaker, Norway, working with video-coding development and implementation.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

560 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003
Overview of the H.264/AVC Video Coding Standard
Thomas Wiegand, Gary J. Sullivan, Senior Member, IEEE, Gisle Bjøntegaard, and Ajay Luthra, Senior Member, IEEE
Abstract—H.264/AVC is newest video coding standard of the
ITU-T Video Coding Experts Group and the ISO/IEC Moving
Picture Experts Group. The main goals of the H.264/AVC stan-
dardization effort have been enhanced compression performance
and provision of a “network-friendly” video representation
addressing “conversational” (video telephony) and “noncon-
versational” (storage, broadcast, or streaming) applications.
H.264/AVC has achieved a significant improvement in rate-distor-
tion efficiency relative to existing standards. This article provides
an overview of the technical features of H.264/AVC, describes
profiles and applications for the standard, and outlines the history
of the standardization process.
Index Terms—AVC, H.263, H.264, JVT, MPEG-2, MPEG-4,
standards, video.
I. INTRODUCTION
H
.264/AVC is the newest international video coding stan-
dard [1]. By the time of this publication, it is expected to
have been approved by ITU-T as Recommendation H.264 and
by ISO/IEC as International Standard 14 496–10 (MPEG-4 part
10) Advanced Video Coding (AVC).
The MPEG-2 video coding standard (also known as ITU-T
H.262) [2], which was developed about ten years ago primarily
as an extension of prior MPEG-1 video capability with support
of interlaced video coding, was an enabling technology for dig-
ital television systems worldwide. It is widely used for the trans-
mission of standard definition (SD) and high definition (HD)
TV signals over satellite, cable, and terrestrial emission and the
storage of high-quality SD video signals onto DVDs.
However, an increasing number of services and growing
popularity of high definition TV are creating greater needs
for higher coding efficiency. Moreover, other transmission
media such as Cable Modem, xDSL, or UMTS offer much
lower data rates than broadcast channels, and enhanced coding
efficiency can enable the transmission of more video channels
or higher quality video representations within existing digital
transmission capacities.
Video coding for telecommunication applications has
evolved through the development of the ITU-T H.261, H.262
(MPEG-2), and H.263 video coding standards (and later
enhancements of H.263 known as
and ),
Manuscript received April 15, 2002; revised May 10, 2003.
T. Wiegand is with the Fraunhofer-Institute for Telecommunications,
Heinrich-Hertz-Institute, Einsteinufer 37, 10587 Berlin, Germany (e-mail:
wiegand@hhi.de).
G. J. Sullivan is with the Microsoft Corporation, Redmond, WA 98052 USA
(e-mail: garysull@microsoft.com).
G. Bjøntegaard is with the Tandberg, N-1324 Lysaker, Norway (e-mail:
gbj@tandberg.no)
A. Luthra is with the Broadband Communications Sector, Motorola, Inc., San
Diego, CA 92121 USA. (e-mail: aluthra@motorola.com)
Digital Object Identifier 10.1109/TCSVT.2003.815165
Fig. 1. Scope of video coding standardization.
and has diversified from ISDN and T1/E1 service to embrace
PSTN, mobile wireless networks, and LAN/Internet network
delivery. Throughout this evolution, continued efforts have
been made to maximize coding efficiency while dealing with
the diversification of network types and their characteristic
formatting and loss/error robustness requirements.
Recently the MPEG-4 Visual (MPEG-4 part 2) standard [5]
has also begun to emerge in use in some application domains of
the prior coding standards. It has provided video shape coding
capability, and has similarly worked toward broadening the
range of environments for digital video use.
In early 1998, the Video Coding Experts Group (VCEG)
ITU-T SG16 Q.6 issued a call for proposals on a project called
H.26L, with the target to double the coding efficiency (which
means halving the bit rate necessary for a given level of fidelity)
in comparison to any other existing video coding standards for
a broad variety of applications. The first draft design for that
new standard was adopted in October of 1999. In December of
2001, VCEG and the Moving Picture Experts Group (MPEG)
ISO/IEC JTC 1/SC 29/WG 11 formed a Joint Video Team
(JVT), with the charter to finalize the draft new video coding
standard for formal approval submission as H.264/AVC [1] in
March 2003.
The scope of the standardization is illustrated in Fig. 1, which
shows the typical video coding/decoding chain (excluding the
transport or storage of the video signal). As has been the case
for all ITU-T and ISO/IEC video coding standards, only the
central decoder is standardized, by imposing restrictions on the
bitstream and syntax, and defining the decoding process of the
syntax elements such that every decoder conforming to the stan-
dard will produce similar output when given an encoded bit-
stream that conforms to the constraints of the standard. This lim-
itation of the scope of the standard permits maximal freedom
to optimize implementations in a manner appropriate to spe-
cific applications (balancing compression quality, implementa-
tion cost, time to market, etc.). However, it provides no guaran-
tees of end-to-end reproduction quality, as it allows even crude
encoding techniques to be considered conforming.
This paper is organizedas follows. Section II provides a high-
level overview of H.264/AVC applications and highlights some
key technical features of the design that enable improved oper-
ation for this broad variety of applications. Section III explains
the network abstraction layer (NAL) and the overall structure
1051-8215/03$17.00 © 2003 IEEE

WIEGAND et al.: OVERVIEW OF THE H.264/AVC VIDEO CODING STANDARD 561
Fig. 2. Structure of H.264/AVC video encoder.
of H.264/AVC coded video data. The video coding layer (VCL)
is described in Section IV. Section V explains the profiles sup-
ported by H.264/AVC and some potential application areas of
the standard.
II. A
PPLICATIONS AND DESIGN FEATURE
HIGHLIGHTS
The new standard is designed for technical solutions in-
cluding at least the following application areas
Broadcast over cable, satellite, cable modem, DSL, terres-
trial, etc.
Interactive or serial storage on optical and magnetic de-
vices, DVD, etc.
Conversational services over ISDN, Ethernet, LAN, DSL,
wireless and mobile networks, modems, etc. or mixtures
of these.
Video-on-demand or multimedia streaming services over
ISDN, cable modem, DSL, LAN, wireless networks, etc.
Multimedia messaging services (MMS) over ISDN, DSL,
ethernet, LAN, wireless and mobile networks, etc.
Moreover, new applications may be deployed over existing and
future networks. This raises the question about how to handle
this variety of applications and networks.
To address this need for flexibility and customizability, the
H.264/AVC design covers a VCL, which is designed to effi-
ciently represent the video content, and a NAL, which formats
the VCL representation of the video and provides header infor-
mation in a manner appropriate for conveyance by a variety of
transport layers or storage media (see Fig. 2).
Relative to prior video coding methods, as exemplified by
MPEG-2 video, some highlighted features of the design that en-
able enhanced coding efficiency include the following enhance-
ments of the ability to predict the values of the content of a pic-
ture to be encoded.
Variable block-size motion compensation with small
block sizes: This standard supports more flexibility in the
selection of motion compensation block sizes and shapes
than any previous standard, with a minimum luma motion
compensation block size as small as 4
4.
Quarter-sample-accurate motion compensation: Most
prior standards enable half-sample motion vector accuracy
at most. The new design improves up on this by adding
quarter-sample motion vector accuracy, as first found in
an advanced profile of the MPEG-4 Visual (part 2) stan-
dard, but further reduces the complexity of the interpola-
tion processing compared to the prior design.
Motion vectors over picture boundaries: While motion
vectors in MPEG-2 and its predecessors were required to
point only to areas within the previously-decoded refer-
ence picture, the picture boundary extrapolation technique
first found as an optional feature in H.263 is included in
H.264/AVC.
Multiple reference picture motion compensation: Pre-
dictively coded pictures (called “P” pictures) in MPEG-2
and its predecessors used only one previous picture to pre-
dict the values in an incoming picture. The new design ex-
tends upon the enhanced reference picture selection tech-
nique found in
to enable efficient coding by al-
lowing an encoder to select, for motion compensation pur-
poses, among a larger number of pictures that have been
decoded and stored in the decoder. The same extension
of referencing capability is also applied to motion-com-
pensated bi-prediction, which is restricted in MPEG-2 to
using two specific pictures only (one of these being the
previousintra (I) or P picture in display order and the other
being the next I or P picture in display order).
Decoupling of referencing order from display order:
In prior standards, there was a strict dependency between
the ordering of pictures for motion compensation refer-
encing purposes and the ordering of pictures for display
purposes. In H.264/AVC, these restrictions are largely re-
moved, allowing the encoder to choose the ordering of
pictures for referencing and display purposes with a high
degree of flexibility constrained only by a total memory
capacity bound imposed to ensure decoding ability. Re-
moval of the restriction also enables removing the extra
delay previously associated with bi-predictive coding.
Decoupling of picture representation methods from
picture referencing capability: In prior standards,
pictures encoded using some encoding methods (namely
bi-predictively-encoded pictures) could not be used as
references for prediction of other pictures in the video
sequence. By removing this restriction, the new standard
provides the encoder more flexibility and, in many cases,
an ability to use a picture for referencing that is a closer
approximation to the picture being encoded.
Weighted prediction: A new innovation in H.264/AVC
allows the motion-compensated prediction signal to be
weighted and offset by amounts specified by the encoder.
This can dramatically improve coding efficiency for
scenes containing fades, and can be used flexibly for
other purposes as well.
Improved “skipped” and “direct” motion inference: In
prior standards, a “skipped” area of a predictively-coded
picture could not motion in the scene content. This had
a detrimental effect when coding video containing global
motion, so the new H.264/AVC design instead infers mo-
tion in “skipped” areas. For bi-predictively coded areas
(called B slices), H.264/AVC also includes an enhanced
motion inference method known as “direct” motion com-
pensation, which improves further on prior “direct” pre-
diction designs found in
and MPEG-4 Visual.

562 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003
Directional spatial prediction for intra coding: Anew
technique of extrapolating the edges of the previously-de-
coded parts of the current picture is applied in regions of
pictures that are coded as intra (i.e., coded without ref-
erence to the content of some other picture). This im-
proves the quality of the prediction signal, and also allows
prediction from neighboring areas that were not coded
using intra coding (something not enabled when using
the transform-domain prediction method found in
and MPEG-4 Visual).
In-the-loop deblocking filtering: Block-based video
coding produces artifacts known as blocking artifacts.
These can originate from both the prediction and residual
difference coding stages of the decoding process. Appli-
cation of an adaptive deblocking filter is a well-known
method of improving the resulting video quality, and when
designed well, this can improve both objective and sub-
jective video quality. Building further on a concept from
an optional feature of
, the deblocking filter in
the H.264/AVC design is brought within the motion-com-
pensated prediction loop, so that this improvement in
quality can be used in inter-picture prediction to improve
the ability to predict other pictures as well.
In addition to improved prediction methods, other parts of
the design were also enhanced for improved coding efficiency,
including the following.
Small block-size transform: All major prior video
coding standards used a transform block size of 8
8,
while the new H.264/AVC design is based primarily on
a4
4 transform. This allows the encoder to represent
signals in a more locally-adaptive fashion, which reduces
artifacts known colloquially as “ringing”. (The smaller
block size is also justified partly by the advances in the
ability to better predict the content of the video using
the techniques noted above, and by the need to provide
transform regions with boundaries that correspond to
those of the smallest prediction regions.)
Hierarchical block transform: While in most cases,
using the small 4
4 transform block size is perceptually
beneficial, there are some signals that contain sufficient
correlation to call for some method of using a repre-
sentation with longer basis functions. The H.264/AVC
standard enables this in two ways: 1) by using a hierar-
chical transform to extend the effective block size use
for low-frequency chroma information to an 8
8 array
and 2) by allowing the encoder to select a special coding
type for intra coding, enabling extension of the length of
the luma transform for low-frequency information to a
16
16 block size in a manner very similar to that applied
to the chroma.
Short word-length transform: All prior standard de-
signs have effectively required encoders and decoders to
use more complex processing for transform computation.
While previous designs have generally required 32-bit
processing, the H.264/AVC design requires only 16-bit
arithmetic.
Exact-match inverse transform: In previous video
coding standards, the transform used for representing
the video was generally specified only within an error
tolerance bound, due to the impracticality of obtaining an
exact match to the ideal specified inverse transform. As
a result, each decoder design would produce slightly dif-
ferent decoded video, causing a “drift” between encoder
and decoder representation of the video and reducing
effective video quality. Building on a path laid out as an
optional feature in the
effort, H.264/AVC is
the first standard to achieve exact equality of decoded
video content from all decoders.
Arithmetic entropy coding: An advanced entropy
coding method known as arithmetic coding is included
in H.264/AVC. While arithmetic coding was previously
found as an optional feature of H.263, a more effective
use of this technique is found in H.264/AVC to create a
very powerful entropy coding method known as CABAC
(context-adaptive binary arithmetic coding).
Context-adaptive entropy coding: The two entropy
coding methods applied in H.264/AVC, termed CAVLC
(context-adaptive variable-length coding) and CABAC,
both use context-based adaptivity to improve performance
relative to prior standard designs.
Robustness to data errors/losses and flexibility for operation
over a variety of network environments is enabled by a number
of design aspects new to the H.264/AVC standard, including the
following highlighted features.
Parameter set structure: The parameter set design pro-
vides for robust and efficient conveyance header informa-
tion. As the loss of a few key bits of information (such as
sequence header or picture header information) could have
a severe negative impact on the decoding process when
using prior standards, this key information was separated
for handling in a more flexible and specialized manner in
the H.264/AVC design.
NAL unit syntax structure: Each syntax structure in
H.264/AVC is placed into a logical data packet called a
NAL unit. Rather than forcing a specific bitstream inter-
face to the system as in prior video coding standards, the
NAL unit syntax structure allows greater customization
of the method of carrying the video content in a manner
appropriate for each specific network.
Flexible slice size: Unlike the rigid slice structure found in
MPEG-2 (which reduces coding efficiency by increasing
the quantity of header data and decreasing the effective-
ness of prediction), slice sizes in H.264/AVC are highly
flexible, as was the case earlier in MPEG-1.
Flexible macroblock ordering (FMO): A new ability to
partition the picture into regions called slice groups has
been developed, with each slice becoming an indepen-
dently-decodable subset of a slice group. When used ef-
fectively, flexible macroblock ordering can significantly
enhance robustness to data losses by managing the spatial
relationship between the regions that are coded in each
slice. (FMO can also be used for a variety of other pur-
poses as well.)
Arbitrary slice ordering (ASO): Since each slice of a
coded picture can be (approximately) decoded indepen-
dently of the other slices of the picture, the H.264/AVC

WIEGAND et al.: OVERVIEW OF THE H.264/AVC VIDEO CODING STANDARD 563
design enables sending and receiving the slices of the
picture in any order relative to each other. This capability,
first found in an optional part of
, can improve
end-to-end delay in real-time applications, particularly
when used on networks having out-of-order delivery
behavior (e.g., internet protocol networks).
Redundant pictures: In order to enhance robustness to
data loss, the H.264/AVC design contains a new ability
to allow an encoder to send redundant representations of
regions of pictures, enabling a (typically somewhat de-
graded) representation of regions of pictures for which the
primary representation has been lost during data transmis-
sion.
DataPartitioning:Since some coded information for rep-
resentation of each region (e.g., motion vectors and other
prediction information) is more important or more valu-
able than other information for purposes of representing
the video content, H.264/AVC allows the syntax of each
slice to be separated into up to three different partitions for
transmission, depending on a categorization of syntax ele-
ments. This part of the design buildsfurther on a path taken
in MPEG-4 Visual and in an optional part of
.
Here, the design is simplified by having a single syntax
with partitioning of that same syntax controlled by a spec-
ified categorization of syntax elements.
SP/SI synchronization/switching pictures: The
H.264/AVC design includes a new feature consisting
of picture types that allow exact synchronization of the
decoding process of some decoders with an ongoing video
stream produced by other decoders without penalizing
all decoders with the loss of efficiency resulting from
sending an I picture. This can enable switching a decoder
between representations of the video content that used
different data rates, recovery from data losses or errors,
as well as enabling trick modes such as fast-forward,
fast-reverse, etc.
In Sections III and IV, a more detailed description of the key
features is given.
III. NAL
The NAL is designed in order to provide “network friendli-
ness” to enable simple and effective customization of the use of
the VCL for a broad variety of systems.
The NAL facilitates the ability to map H.264/AVC VCL data
to transport layers such as:
RTP/IP for any kind of real-time wire-line and wireless
Internet services (conversational and streaming);
File formats, e.g., ISO MP4 for storage and MMS;
H.32X for wireline and wireless conversational services;
MPEG-2 systems for broadcasting services, etc.
The full degree of customization of the video content to fit the
needs of each particular application is outside the scope of the
H.264/AVC standardization effort, but the design of the NAL
anticipates a variety of such mappings. Some key concepts of
the NAL are NAL units, byte stream, and packet format uses of
NAL units, parameter sets, and access units. A short descrip-
tion of these concepts is given below whereas a more detailed
description including error resilience aspects is provided in [6]
and [7].
A. NAL Units
The coded video data is organized into NAL units, each of
which is effectively a packet that contains an integer number
of bytes. The first byte of each NAL unit is a header byte that
contains an indication of the type of data in the NAL unit, and
the remaining bytes contain payload data of the type indicated
by the header.
The payload data in the NAL unit is interleaved as necessary
with emulation prevention bytes, which are bytes inserted with
a specific value to prevent a particular pattern of data called a
start code prefix from being accidentally generated inside the
payload.
The NAL unit structure definition specifies a generic format
for use in both packet-oriented and bitstream-oriented transport
systems, and a series of NAL units generated by an encoder is
referred to as a NAL unit stream.
B. NAL Units in Byte-Stream Format Use
Some systems (e.g., H.320 and MPEG-2/H.222.0 systems)
require delivery of the entire or partial NAL unit stream as an or-
dered stream of bytes or bits within which the locations of NAL
unit boundaries need to be identifiable from patterns within the
coded data itself.
For use in such systems, the H.264/AVC specification defines
a byte stream format. In the byte stream format, each NAL unit
is prefixed by a specific pattern of three bytes called a start code
prefix. The boundaries of the NAL unit can then be identified by
searching the coded data for the unique start code prefix pattern.
The use of emulation prevention bytes guarantees that start code
prefixes are unique identifiers of the start of a new NAL unit.
A small amount of additional data (one byte per video pic-
ture) is also added to allow decoders that operate in systems that
provide streams of bits without alignment to byte boundaries to
recover the necessary alignment from the data in the stream.
Additional data can also be inserted in the byte stream format
that allows expansion of the amount of data to be sent and can
aid in achieving more rapid byte alignment recovery, if desired.
C. NAL Units in Packet-Transport System Use
In other systems (e.g., internet protocol/RTP systems), the
coded data is carried in packets that are framed by the system
transport protocol, and identification of the boundaries of NAL
units within the packets can be established without use of start
code prefix patterns. In such systems, the inclusion of start code
prefixes in the data would be a waste of data carrying capacity,
so instead the NAL units can be carried in data packets without
start code prefixes.
D. VCL and Non-VCL NAL Units
NAL units are classified into VCL and non-VCL NAL units.
The VCL NAL units contain the data that represents the values
of the samples in the video pictures, and the non-VCL NAL
units contain any associated additional information such as pa-
rameter sets (important header data that can apply to a large

564 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003
Fig. 3. Parameter set use with reliable “out-of-band” parameter set exchange.
number of VCL NAL units) and supplemental enhancement in-
formation (timing information and other supplemental data that
may enhance usability of the decoded video signal but are not
necessary for decoding the values of the samples in the video
pictures).
E. Parameter Sets
A parameter set is supposed to contain information that is
expected to rarely change and offers the decoding of a large
number of VCL NAL units. There are two types of parameter
sets:
sequence parameter sets, which apply to a series of con-
secutive coded video pictures called a coded video se-
quence;
picture parameter sets, which apply to the decoding of one
or more individualpictures within a coded video sequence.
The sequence and picture parameter-set mechanism decouples
the transmission of infrequently changing information from the
transmission of coded representations of the values of the sam-
ples in the video pictures. Each VCL NAL unit contains an iden-
tifier that refers to the content of the relevant picture parameter
set and each picture parameter set contains an identifier that
refers to the content of the relevant sequence parameter set. In
this manner, a small amount of data (the identifier) can be used
to refer to a larger amount of information (the parameter set)
without repeating that information within each VCL NAL unit.
Sequence and picture parameter sets can be sent well ahead
of the VCL NAL units that they apply to, and can be repeated to
provide robustness against data loss. In some applications, pa-
rameter sets may be sent within the channel that carries the VCL
NAL units (termed “in-band” transmission). In other applica-
tions (see Fig. 3), it can be advantageous to convey the param-
eter sets “out-of-band” using a more reliable transport mecha-
nism than the video channel itself.
F. Access Units
A set of NAL units in a specified form is referred to as an
access unit. The decoding of each access unit results in one de-
coded picture. The format of an access unit is shown in Fig. 4.
Each access unit contains a set of VCL NAL units that to-
gether compose a primary coded picture. It may also be prefixed
with an access unit delimiter to aid in locating the start of the
access unit. Some supplemental enhancement information con-
taining data such as picture timing information may also precede
the primary coded picture.
Fig. 4. Structure of an access unit.
The primary coded picture consists of a set of VCL NALunits
consisting of slices or slice data partitions that represent the
samples of the video picture.
Following the primary coded picture may be some additional
VCL NAL units that contain redundant representations of areas
of the same video picture. These are referred to as redundant
coded pictures, and are available for use by a decoder in recov-
ering from loss or corruption of the data in the primary coded
pictures. Decoders are not required to decode redundant coded
pictures if they are present.
Finally, if the coded picture is the last picture of a coded
video sequence (a sequence of pictures that is independently
decodable and uses only one sequence parameter set), an end
of sequence NAL unit may be present to indicate the end of the
sequence; and if the coded picture is the last coded picture in
the entire NAL unit stream, an end of stream NAL unit may be
present to indicate that the stream is ending.
G. Coded Video Sequences
A coded video sequence consists of a series of access units
that are sequential in the NAL unit stream and use only one se-
quence parameter set. Each coded video sequence can be de-
coded independently of any other coded video sequence, given
the necessary parameter set information, which may be con-
veyed “in-band” or “out-of-band”. At the beginning of a coded
video sequence is an instantaneous decoding refresh (IDR) ac-
cess unit. An IDR access unit contains an intra picture—a coded

Citations
More filters
Journal ArticleDOI
TL;DR: The main goal of the HEVC standardization effort is to enable significantly improved compression performance relative to existing standards-in the range of 50% bit-rate reduction for equal perceptual video quality.
Abstract: High Efficiency Video Coding (HEVC) is currently being prepared as the newest video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goal of the HEVC standardization effort is to enable significantly improved compression performance relative to existing standards-in the range of 50% bit-rate reduction for equal perceptual video quality. This paper provides an overview of the technical features and characteristics of the HEVC standard.

7,383 citations


Cites background from "Overview of the H.264/AVC video cod..."

  • ...It is, rather, a plurality of smaller improvements that add up to the significant gain....

    [...]

Journal ArticleDOI
TL;DR: An overview of the basic concepts for extending H.264/AVC towards SVC are provided and the basic tools for providing temporal, spatial, and quality scalability are described in detail and experimentally analyzed regarding their efficiency and complexity.
Abstract: With the introduction of the H.264/AVC video coding standard, significant improvements have recently been demonstrated in video compression capability. The Joint Video Team of the ITU-T VCEG and the ISO/IEC MPEG has now also standardized a Scalable Video Coding (SVC) extension of the H.264/AVC standard. SVC enables the transmission and decoding of partial bit streams to provide video services with lower temporal or spatial resolutions or reduced fidelity while retaining a reconstruction quality that is high relative to the rate of the partial bit streams. Hence, SVC provides functionalities such as graceful degradation in lossy transmission environments as well as bit rate, format, and power adaptation. These functionalities provide enhancements to transmission and storage applications. SVC has achieved significant improvements in coding efficiency with an increased degree of supported scalability relative to the scalable profiles of prior video coding standards. This paper provides an overview of the basic concepts for extending H.264/AVC towards SVC. Moreover, the basic tools for providing temporal, spatial, and quality scalability are described in detail and experimentally analyzed regarding their efficiency and complexity.

3,592 citations


Cites background from "Overview of the H.264/AVC video cod..."

  • ...This property in conjunction with unequal error protection is especially useful in any transmission scenario with unpredictable throughput variations and/or relatively high packet loss rates....

    [...]

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations


Cites background from "Overview of the H.264/AVC video cod..."

  • ...264 describes the lossy compression of a video stream [25] and is also part of ISO/IEC MPEG-4....

    [...]

Journal ArticleDOI
TL;DR: A unified approach to the coder control of video coding standards such as MPEG-2, H.263, MPEG-4, and the draft video coding standard H.264/AVC (advanced video coding) is presented.
Abstract: A unified approach to the coder control of video coding standards such as MPEG-2, H.263, MPEG-4, and the draft video coding standard H.264/AVC (advanced video coding) is presented. The performance of the various standards is compared by means of PSNR and subjective testing results. The results indicate that H.264/AVC compliant encoders typically achieve essentially the same reproduction quality as encoders that are compliant with the previous standards while typically requiring 60% or less of the bit rate.

3,312 citations

References
More filters
Journal ArticleDOI
TL;DR: It is demonstrated how the quality of the B pictures should be reduced to improve the overall rate-distortion performance of the scalable representation and shown that the gains by multihypothesis prediction and arithmetic coding are additive.
Abstract: This paper reviews recent advances in using B pictures in the context of the draft H.264/AVC video-compression standard. We focus on reference picture selection and linearly combined motion-compensated prediction signals. We show that bidirectional prediction exploits partially the efficiency of combined prediction signals whereas multihypothesis prediction allows a more general form of B pictures. The general concept of linearly combined prediction signals chosen from an arbitrary set of reference pictures improves the H.264/AVC test model TML-9 which is used in the following. We outline H.264/AVC macroblock prediction modes for B pictures, classify them into four groups and compare their efficiency in terms of rate-distortion performance. When investigating multihypothesis prediction, we show that bidirectional prediction is a special case of this concept. Multihypothesis prediction allows also two combined forward prediction signals. Experimental results show that this case is also advantageous in terms of compression efficiency. The draft H.264/AVC video-compression standard offers improved entropy coding by context-based adaptive binary arithmetic coding. Simulations show that the gains by multihypothesis prediction and arithmetic coding are additive. B pictures establish an enhancement layer and are predicted from reference pictures that are provided by the base layer. The quality of the base layer influences the rate-distortion trade-off for B pictures. We demonstrate how the quality of the B pictures should be reduced to improve the overall rate-distortion performance of the scalable representation.

265 citations


"Overview of the H.264/AVC video cod..." refers result in this paper

  • ...This extension refers back to [11] and is further investigated in [ 12 ]....

    [...]

Journal ArticleDOI
TL;DR: In order to reduce the bit rate of video signals, the standardized coding techniques apply motion-compensated prediction in combination with transform coding of the prediction error and a Wiener interpolation filter and 1/4-pel displacement vector resolution is applied.
Abstract: In order to reduce the bit rate of video signals, the standardized coding techniques apply motion-compensated prediction in combination with transform coding of the prediction error. By mathematical analysis, it is shown that aliasing components are deteriorating the prediction efficiency. In order to compensate the aliasing, two-dimensional (2-D) and three-dimensional interpolation filters are developed. As a result, motion- and aliasing-compensated prediction with 1/4-pel displacement vector resolution and a separable 2-D Wiener interpolation filter provide a coding gain of up to 2 dB when compared to 1/2-pel displacement vector resolution as it is used in H.263 or MPEG-2. An additional coding gain of 1 dB can be obtained with 1/8-pel displacement vector resolution when compared to 1/4-pel displacement vector resolution. In consequence of the significantly improved coding efficiency, a Wiener interpolation filter and 1/4-pel displacement vector resolution is applied in H.264/AVC and in MPEG-4 (advanced simple profile).

247 citations

Journal ArticleDOI
TL;DR: This paper presents a new HRD for H.264/AVC that is more general and flexible than those defined in prior standards and provides significant additional benefits.
Abstract: In video coding standards, a compliant bit stream must be decoded by a hypothetical decoder that is conceptually connected to the output of an encoder and consists of a decoder buffer, a decoder, and a display unit. This virtual decoder is known as the hypothetical reference decoder (HRD) in H.263 and the video buffering verifier in MPEG. The encoder must create a bit stream so that the hypothetical decoder buffer does not overflow or underflow. These previous decoder models assume that a given bit stream will be transmitted through a channel of a known bit rate and will be decoded (after a given buffering delay) by a device of some given buffer size. Therefore, these models are quite rigid and do not address the requirements of many of today's important video applications such as broadcasting video live or streaming pre-encoded video on demand over network paths with various peak bit rates to devices with various buffer sizes. In this paper, we present a new HRD for H.264/AVC that is more general and flexible than those defined in prior standards and provides significant additional benefits.

175 citations

Proceedings ArticleDOI
30 Mar 1998
TL;DR: The presented design algorithm provides an estimation criterion for optimal multi-hypotheses, a rule for optimal displacement codes, and a condition for optimal predictor coefficients, and experimental results show that Increasing the number of hypotheses from 1 to 8 provides prediction gains up to 3 dB in prediction error.
Abstract: Multi-hypothesis motion-compensated prediction extends traditional motion-compensated prediction used in video coding schemes. Known algorithms for block-based multi-hypothesis motion-compensated prediction are, for example, overlapped block motion compensation (OBMC) and bidirectionally predicted frames (B-frames). This paper presents a generalization of these algorithms in a rate-distortion framework. All blocks which are available for prediction are called hypotheses. Further, we explicitly distinguish between the search space and the superposition of hypotheses. Hypotheses are selected from a search space and their spatio-temporal positions are transmitted by means of spatio-temporal displacement codewords. Constant predictor coefficients are used to combine linearly hypotheses of a multi-hypothesis. The presented design algorithm provides an estimation criterion for optimal multi-hypotheses, a rule for optimal displacement codes, and a condition for optimal predictor coefficients. Statistically dependent hypotheses of a multi-hypothesis are determined by an iterative algorithm. Experimental results show that Increasing the number of hypotheses from 1 to 8 provides prediction gains up to 3 dB in prediction error.

103 citations


"Overview of the H.264/AVC video cod..." refers background in this paper

  • ...This extension refers back to [11] and is further investigated in [12]....

    [...]

Book
30 Sep 2001
TL;DR: This paper presents State-Of-The-Art Video Transmission: Rate-Constrained Coder Control with Affine Multi-Frame Motion-Compensated Prediction, a novel approach to multi-Frame prediction that combines rate-constrained coder control and reinforcement learning.
Abstract: Preface. Introduction. 1. State-Of-The-Art Video Transmission. 2. Rate-Constrained Coder Control. 3. Long-Term Memory Motion-Compensated Prediction. 4. Affine Multi-Frame Motion-Compensated Prediction. 5. Fast Motion Estimation for Multi-Frame Prediction. 6. Error Resilient Video Transmission. 7. Conclusions. References. Index.

98 citations


"Overview of the H.264/AVC video cod..." refers methods in this paper

  • ...The syntax supports multipicture motion-compensated prediction [9], [10]....

    [...]

Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "Overview of the h.264/avc video coding standard" ?

This article provides an overview of the technical features of H. 264/AVC, describes profiles and applications for the standard, and outlines the history of the standardization process. 

When used effectively, flexible macroblock ordering can significantly enhance robustness to data losses by managing the spatial relationship between the regions that are coded in each slice. 

Other services that operate at lower bit rates and are distributed via file transfer and therefore do not impose delay constraints at all, which can potentially be served by any of the three profiles depending on various other systems requirements are:—3GPP multimedia messaging services; —video mail. 

For transmitting the quantized transform coefficients, a more efficient method called Context-Adaptive Variable Length Coding (CAVLC) is employed. 

The sequence and picture parameter-set mechanism decouples the transmission of infrequently changing information from the transmission of coded representations of the values of the samples in the video pictures. 

The samples at quarter sample positions labeled as a, c, d, n, f, i, k, and q are derived by averaging with upward rounding of the two nearest samples at integer and half sample positions as, for example, byThe samples at quarter sample positions labeled as e, g, p, and r are derived by averaging with upward rounding of the two nearest samples at half sample positions in the diagonal direction as, for example, byThe prediction values for the chroma component are always obtained by bilinear interpolation. 

In prior standards, pictures encoded using some encoding methods (namely bi-predictively-encoded pictures) could not be used as references for prediction of other pictures in the video sequence. 

The prediction values at half-sample positions are obtained by applying aone-dimensional 6-tap FIR filter horizontally and vertically. 

The H.264/AVC standard enables this in two ways: 1) by using a hierarchical transform to extend the effective block size use for low-frequency chroma information to an 8 8 array and 2) by allowing the encoder to select a special coding type for intra coding, enabling extension of the length of the luma transform for low-frequency information to a 16 16 block size in a manner very similar to that applied to the chroma. 

While arithmetic coding was previously found as an optional feature of H.263, a more effective use of this technique is found in H.264/AVC to create a very powerful entropy coding method known as CABAC (context-adaptive binary arithmetic coding). 

In other applications (see Fig. 3), it can be advantageous to convey the parameter sets “out-of-band” using a more reliable transport mechanism than the video channel itself.