Overview of the H.264/AVC video coding standard
Summary (5 min read)
Introduction
- The new standard is designed for technical solutions including at least the following application areas Broadcast over cable, satellite, cable modem, DSL, terrestrial, etc. Interactive or serial storage on optical and magnetic devices, DVD, etc. Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile networks, modems, etc. or mixtures of these.
- Multiple reference picture motion compensation:Predictively coded pictures (called “P” pictures) in MPEG-2 and its predecessors used only one previous picture to predict the values in an incoming picture.
- In addition to improved prediction methods, other parts of the design were also enhanced for improved coding efficiency, including the following.
- Robustness to data errors/losses and flexibility for operation over a variety of network environments is enabled by a number of design aspects new to the H.264/AVC standard, including the following highlighted features.
A. NAL Units
- The coded video data is organized into NAL units, each of which is effectively a packet that contains an integer number of bytes.
- The first byte of each NAL unit is a header byte that contains an indication of the type of data in the NAL unit, and the remaining bytes contain payload data of the type indicated by the header.
- The payload data in the NAL unit is interleaved as necessary with emulation prevention bytes, which are bytes inserted with a specific value to prevent a particular pattern of data called a start code prefixfrom being accidentally generated inside the payload.
- The NAL unit structure definition specifies a generic format for use in both packet-oriented and bitstream-oriented transport systems, and a series of NAL units generated by an encoder is referred to as a NAL unit stream.
B. NAL Units in Byte-Stream Format Use
- Some systems (e.g., H.320 and MPEG-2/H.222.0 systems) require delivery of the entire or partial NAL unit stream as an ordered stream of bytes or bits within which the locations of NAL unit boundaries need to be identifiable from patterns within the coded data itself.
- For use in such systems, the H.264/AVC specification defines a byte stream format.
- In the byte stream format, each NAL unit is prefixed by a specific pattern of three bytes called a start code prefix.
- The use of emulation prevention bytes guarantees that start code prefixes are unique identifiers of the start of a new NAL unit.
- A small amount of additional data (one byte per video picture) is also added to allow decoders that operate in systems that provide streams of bits without alignment to byte boundaries to recover the necessary alignment from the data in the stream.
C. NAL Units in Packet-Transport System Use
- In other systems (e.g., internet protocol/RTP systems), the coded data is carried in packets that are framed by the system transport protocol, and identification of the boundaries of NAL units within the packets can be established without use of start code prefix patterns.
- In such systems, the inclusion of start code prefixes in the data would be a waste of data carrying capacity, so instead the NAL units can be carried in data packets without start code prefixes.
- D. VCL and Non-VCL NAL Units NAL units are classified into VCL and non-VCL NAL units.
E. Parameter Sets
- There are two types of parameter sets: sequence parameter sets, which apply to a series of consecutive coded video pictures called a coded video sequence; picture parameter sets, which apply to the decoding of one or more individual pictures within a coded video sequence.
- The sequence and picture parameter-set mechanism decouples the transmission of infrequently changing information from the transmission of coded representations of the values of the samples in the video pictures.
- Each VCL NAL unit contains an identifier that refers to the content of the relevant picture parameter set and each picture parameter set contains an identifier that refers to the content of the relevant sequence parameter set.
- In some applications, parameter sets may be sent within the channel that carries the VCL NAL units (termed “in-band” transmission).
- In other applications (see Fig. 3), it can be advantageous to convey the parameter sets “out-of-band” using a more reliable transport mechanism than the video channel itself.
F. Access Units
- The decoding of each access unit results in one decoded picture.
- Each access unit contains a set of VCL NAL units that together compose aprimary coded picture.
- It may also be prefixed with an access unit delimiterto aid in locating the start of the access unit.
- Following the primary coded picture may be some additional VCL NAL units that contain redundant representations of areas of the same video picture.
- These are referred to asredundant coded pictures, and are available for use by a decoder in recovering from loss or corruption of the data in the primary coded pictures.
G. Coded Video Sequences
- A coded video sequence consists of a series of access units that are sequential in the NAL unit stream and use only one sequence parameter set.
- Each coded video sequence can be decoded independently of any other coded video sequence, given the necessary parameter set information, which may be conveyed “in-band” or “out-of-band”.
- At the beginning of a coded video sequence is ani stantaneous decoding refresh(IDR) access unit.
- As in all prior ITU-T and ISO/IEC JTC1 video standards since H.261 [3], the VCL design follows the so-called blockbased hybrid video coding approach (as depicted in Fig. 8), in which each coded picture is represented in block-shaped units of associated luma and chroma samples calledmacroblocks.
- The basic source-coding algorithm is a hybrid of inter-picture prediction to exploit temporal statistical dependencies and transform coding of the prediction residual to exploit spatial statistical dependencies.
A. Pictures, Frames, and Fields
- A coded picture in [1] can represent either an entireframeor a singlefield, as was also the case for MPEG-2 video.
- The top field contains even-numbered rows 0, 2,…,H/2–1 with H being the number of rows of the frame.
- The bottom field contains the odd-numbered rows (starting with the second line of the frame).
- Is primarily agnostic with respect to this video characteristic, i.e., the underlying interlaced or progressive timing of the original captured pictures.
- Instead, its coding specifies a representation based primarily on geometric concepts rather than being based on timing.
B. YCbCr Color Space and 4:2:0 Sampling
- The human visual system seems to perceive scene content in terms of brightness and color information separately, and with greater sensitivity to the details of brightness than color.
- Video transmission systems can be designed to take advantage of this.
- The video color space used by H.264/AVC separates a color representation into three components called Y, Cb, and Cr. Component Y is calledluma, and represents brightness.
- This is called 4:2:0 sampling with 8 bits of precision per sample.
- (Proposals for extension of the standard to also support higher-resolution chroma and a larger number of bits per sample are currently being considered.).
C. Division of the Picture Into Macroblocks
- A picture is partitioned into fixed-size macroblocks that each cover a rectangular picture area of 1616 samples of the luma component and 88 samples of each of the two chroma components.
- This partitioning into macroblocks has been adopted into all previous ITU-T and ISO/IEC JTC1 video coding standards since H.261 [3].
- Macroblocks are the basic building blocks of the standard for which the decoding process is specified.
- The basic coding algorithm for a macroblock is described after the authors explain how macroblocks are grouped into slices.
D. Slices and Slice Groups
- Slices are a sequence of macroblocks which are processed in the order of a raster scan when not using FMO which is described in the next paragraph.
- The macroblock to slice group map consists of a slice group identification number for each macroblock in the picture, specifying which slice group the associated macroblock belongs to.
- Using FMO, a picture can be split into many macroblock scanning patterns such as interleaved slices, a dispersed macroblock allocation, one or more “foreground” slice groups and a “leftover” slice group, or a checker-board type of mapping.
- The latter two are illustrated in Fig. 7.
- In addition to the coding types available in a P slice, some macroblocks of the B slice can also be coded using inter prediction withtwo motion-compensated prediction signals per prediction block.
E. Encoding and Decoding Process for Macroblocks
- All luma and chroma samples of a macroblock are either spatially or temporally predicted, and the resulting prediction residual is encoded using transform coding.
- For transform coding purposes, each color component of the prediction residual signal is subdivided into smaller 4 blocks.
- Each block is transformed using an integer transform, and the transform coefficients are quantized and encoded using entropy coding methods.
- The input video signal is split into macroblocks, the association of macroblocks to slice groups and slices is selected, and then each macroblock of each slice is processed as shown.
- An efficient parallel processing of macroblocks is possible when there are various slices in the picture.
F. Adaptive Frame/Field Coding Operation
- In interlaced frames with regions of moving objects or camera motion, two adjacent rows tend to show a reduced degree of statistical dependency when compared to progressive frames in.
- Therefore, the frame/field encoding decision can also be made independently for each vertical pair of macroblocks (a 16 32 luma region) in a frame.
- Prediction mode 0 (vertical prediction), mode 1 (hor- izontal prediction), and mode 2 (DC prediction) are specified similar to the modes in Intra_44 prediction except that instead of 4 neighbors on each side to predict a 44 block, 16 neighbors on each side to predict a 1616 block are used.
- That is, more than one prior coded picture can be used as reference for motion-compensated prediction.
I. Transform, Scaling, and Quantization
- Similar to previous video coding standards, H.264/AVC utilizes transform coding of the prediction residual.
- The smaller transform requires less computations and a smaller processing wordlength.
- Involves only adds and shifts, it is also specified such that mismatch between encoder and decoder is avoided (this has been a problem with earlier 8 8 DCT standards).
- A quantization parameter isused for determining the quantiza-.
L. Hypothetical Reference Decoder
- One of the key benefits provided by a standard is the assurance that all the decoders compliant with the standard will be able to decode a compliant compressed video.
- Performance of the deblocking filter for highly compressed pictures (a) without deblocking filter and (b) with deblocking filter.
- Specifying input and output buffer models and developing an implementation independent model of a receiver achieves this.
- In H.264/AVC HRD specifies operation of two buffers: 1) the coded picture buffer (CPB) and 2) the decoded picture buffer (DPB).
- CPB models the arrival and removal time of the coded bits.
A. Profiles and Levels
- These conformance points are designed to facilitate interoperability between various applications of the standard that have similar functional requirements.
- The Baseline profile supports all features in H.264/AVC.
- SP/SI slices, and slice data partitioning, also known as Set 2.
- The first set of additional features is supported by the Main profile.
B. Areas for the Profiles of the New Standard to be Used
- The increased compression efficiency of H.264/AVC offers to enhance existing applications or enables new applications.
- In TML-7/8, 1/8-sample accurate motion compensation was introduced which was then dropped for complexity reasons in JM-5. cients).
- Since then, he has published several conference and journal papers on the subject and has contributed successfully to the ITU-T Video Coding Experts Group (ITU-T SG16 Q.6—VCEG)/ISO/IEC Moving Pictures Experts Group (ISO/IEC JTC1/SC29/WG11—MPEG)/Joint Video Team (JVT) standardization efforts and holds various international patents in this field.
- Since 2002, he has been a Principal Scientist at Tandberg Telecom, Lysaker, Norway, working with video-coding development and implementation.
Did you find this useful? Give us your feedback
Citations
7,383 citations
Cites background from "Overview of the H.264/AVC video cod..."
...It is, rather, a plurality of smaller improvements that add up to the significant gain....
[...]
3,592 citations
Cites background from "Overview of the H.264/AVC video cod..."
...This property in conjunction with unequal error protection is especially useful in any transmission scenario with unpredictable throughput variations and/or relatively high packet loss rates....
[...]
3,514 citations
Cites background from "Overview of the H.264/AVC video cod..."
...264 describes the lossy compression of a video stream [25] and is also part of ISO/IEC MPEG-4....
[...]
3,312 citations
References
265 citations
"Overview of the H.264/AVC video cod..." refers result in this paper
...This extension refers back to [11] and is further investigated in [ 12 ]....
[...]
247 citations
175 citations
103 citations
"Overview of the H.264/AVC video cod..." refers background in this paper
...This extension refers back to [11] and is further investigated in [12]....
[...]
98 citations
"Overview of the H.264/AVC video cod..." refers methods in this paper
...The syntax supports multipicture motion-compensated prediction [9], [10]....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. What is the way to improve the robustness to data losses?
When used effectively, flexible macroblock ordering can significantly enhance robustness to data losses by managing the spatial relationship between the regions that are coded in each slice.
Q3. What are some other services that can be served by the 3GPP profile?
Other services that operate at lower bit rates and are distributed via file transfer and therefore do not impose delay constraints at all, which can potentially be served by any of the three profiles depending on various other systems requirements are:—3GPP multimedia messaging services; —video mail.
Q4. What is the method for determining the quantized transform coefficients?
For transmitting the quantized transform coefficients, a more efficient method called Context-Adaptive Variable Length Coding (CAVLC) is employed.
Q5. What is the mechanism of sequence and picture parameter sets?
The sequence and picture parameter-set mechanism decouples the transmission of infrequently changing information from the transmission of coded representations of the values of the samples in the video pictures.
Q6. How are the predictions obtained at quarter sample positions?
The samples at quarter sample positions labeled as a, c, d, n, f, i, k, and q are derived by averaging with upward rounding of the two nearest samples at integer and half sample positions as, for example, byThe samples at quarter sample positions labeled as e, g, p, and r are derived by averaging with upward rounding of the two nearest samples at half sample positions in the diagonal direction as, for example, byThe prediction values for the chroma component are always obtained by bilinear interpolation.
Q7. What was the restriction on the ordering of pictures in prior standards?
In prior standards, pictures encoded using some encoding methods (namely bi-predictively-encoded pictures) could not be used as references for prediction of other pictures in the video sequence.
Q8. How are the predictions obtained at half-sample positions?
The prediction values at half-sample positions are obtained by applying aone-dimensional 6-tap FIR filter horizontally and vertically.
Q9. How does the encoder use the luma transform?
The H.264/AVC standard enables this in two ways: 1) by using a hierarchical transform to extend the effective block size use for low-frequency chroma information to an 8 8 array and 2) by allowing the encoder to select a special coding type for intra coding, enabling extension of the length of the luma transform for low-frequency information to a 16 16 block size in a manner very similar to that applied to the chroma.
Q10. What is the effective use of arithmetic coding?
While arithmetic coding was previously found as an optional feature of H.263, a more effective use of this technique is found in H.264/AVC to create a very powerful entropy coding method known as CABAC (context-adaptive binary arithmetic coding).
Q11. What is the way to convey the parameter sets out of band?
In other applications (see Fig. 3), it can be advantageous to convey the parameter sets “out-of-band” using a more reliable transport mechanism than the video channel itself.