What are the levels of MVC encoders?

Levels impose constraints on the bitstreams produced by MVC encoders, to establish bounds on the necessary decoder resources and complexity.

What is the purpose of the MVC design?

For applications in which random access or view switching is important, the prediction structure can be designed to minimize access delay, and the MVC design provides a way for an encoder to describe the prediction structure for this purpose.

What was the first call for proposals for efficient multiview video coding?

Considering recent advancements in video compression technology and the anticipated needs for state-of-the-art coding of multiview video, MPEG issued a Call for Proposals (CfP) for efficient multiview video coding technology in October of 2005.

What is the main consequence of not requiring changes to lower levels of the syntax?

A major consequence of not requiring changes to lower levels of the syntax (at the macroblock level and below it) is that MVC is compatible with existing hardware for decoding single-view video with H.264/MPEG-4 AVC.

What is the average reduction in bit rate for a single view of stereo movie?

In other studies [50], an average reduction of 20-30% of the bit rate for the second (dependent) view of typical stereo movie content was reported, with a peak reduction for an individual test sequence of 43% of the bit rate of the dependent view.

What is the way to reduce the bit rate of asymmetrical coding?

Prior studies on asymmetrical coding of stereo video, in which one of the views is encoded with lower quality than the other, suggest that a further substantial savings in bit rate for the non-base view could be achieved using that technique.

What are the main aspects of the MVC design?

Several other aspects of the MVC design were further elaborated on in [44], including random access and view switching, extraction of operation points (sets of coded views at particular levels of a nested temporal referencing structure) of an MVC bitstream for adaptation to network and device constraints, parallel processing, and a description of several newly adoptedPROCEEDINGS OF THE IEEE (2011): VETRO, WIEGAND, SULLIVAN7SEI messages that are relevant for multiview video bitstreams.

(Open Access) Overview of the Stereo and Multiview Video Coding Extensions of the H.264/MPEG-4 AVC Standard (2011) | Anthony Vetro

MITSUBISHI ELECTRIC RESEARCH LABORATORIES

http://www.merl.com

Overview of the Stereo and Multiview Video

Coding Extensions of the H.264/MPEG-4

AVC Standard

Vetro, A.; Wiegand, T.; Sullivan G.J.

TR2011-022 January 2011

Abstract

Signiﬁcant improvements in video compression capability have been demonstrated with the in-

troduction of the H.264/MPEG-4 Advanced Video Coding (AVC) standard. Since developing

this standard, the Joint Video Team of the ITU-T Video Coding Experts Group (VCEG) and

the ISO/IEC Moving Picture Experts Group (MPEG) has also standardized an extension of that

technology that is referred to as multiview video coding (MVC). MVC provides a compact rep-

resentation for multiple views of a video scene, such as multiple synchronized video cameras.

Stereo-paired video for 3D viewing is an important special case of MVC. The standard enables

inter-view prediction to improve compression capability, as well as supporting ordinary tempo-

ral and spatial prediction. It also supports backward compatibility with existing legacy systems

by structuring the MVC bitstream to include a compatible ”base view”. Each other view is

encoded at the same picture resolution as the base view. In recognition of its high quality en-

coding capability and support for backward compatibility, the Stereo High proﬁle of the MVC

extension was selected by the Blu-Ray Disc Association as the coding format for 3D video with

high-deﬁnition resolution. This paper provides an overview of the algorithmic design used for

extending H.264/MPEG-4 AVC towards MVC. The basic approach of MVC for enabling inter-

view prediction and view scalability in the context of H.264/MPEG-4 AVC is reviewed. Related

supplemental enhancement information (SEI) metadata is also described. Various ”frame com-

patible” approaches for support of stereo-view video as an alternative to MVC are also discussed.

A summary of the coding performance achieved by MVC for both stereo and multiview video is

also provided. Future directions and challenges related to 3D video are also brieﬂy discussed.

Proceedings of the IEEE

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part

without payment of fee is granted for nonproﬁt educational and research purposes provided that all such whole or partial copies include

the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of

the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or

republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All

rights reserved.

 Mitsubishi Electric Research Laboratories, Inc., 2011

201 Broadway, Cambridge, Massachusetts 02139

MERLCoverPageSide2

ROCEEDINGS OF THE

IEEE

(2011):

ETRO

IEGAND

ULLIVAN

Abstract—Significant improvements in video compression ca-

pability have been demonstrated with the introduction of the

H.264/MPEG-4 Advanced Video Coding (AVC) standard. Since

developing this standard, the Joint Video Team of the ITU-T Vid-

eo Coding Experts Group (VCEG) and the ISO/IEC Moving Pic-

ture Experts Group (MPEG) has also standardized an extension

of that technology that is referred to as multiview video coding

(MVC). MVC provides a compact representation for multiple

views of a video scene, such as multiple synchronized video cam-

eras. Stereo-paired video for 3D viewing is an important special

case of MVC. The standard enables inter-view prediction to im-

prove compression capability, as well as supporting ordinary

temporal and spatial prediction. It also supports backward com-

patibility with existing legacy systems by structuring the MVC

bitstream to include a compatible "base view". Each other view is

encoded at the same picture resolution as the base view. In recog-

nition of its high quality encoding capability and support for

backward compatibility, the Stereo High profile of the MVC ex-

tension was selected by the Blu-Ray Disc Association as the cod-

ing format for 3D video with high-definition resolution. This pa-

per provides an overview of the algorithmic design used for ex-

tending H.264/MPEG-4 AVC towards MVC. The basic approach

of MVC for enabling inter-view prediction and view scalability in

the context of H.264/MPEG-4 AVC is reviewed. Related supple-

mental enhancement information (SEI) metadata is also de-

scribed. Various "frame compatible" approaches for support of

stereo-view video as an alternative to MVC are also discussed. A

summary of the coding performance achieved by MVC for both

stereo and multiview video is also provided. Future directions and

challenges related to 3D video are also briefly discussed.

Index Terms—MVC, H.264, MPEG-4, AVC, standards, stereo

video, multiview video coding, inter-view prediction, 3D video,

Blu-ray Disc

I. I

NTRODUCTION

VIDEO is currently being introduced to the home

through various channels, including Blu-ray Disc,

cable and satellite transmission, terrestrial broadcast, and

streaming and download through the Internet. Today’s 3D

Manuscript received April 1, 2010.

Revised version submitted MM DD, 2010.

A. Vetro is with Mitsubishi Electric Research Labs, Cambridge, MA,

02139 USA (email: avetro@merl.com).

T. Wiegand is jointly affiliated with the Berlin Institute of technology and

the Fraunhofer Institute for Telecommunications – Heinrich Hertz Institute

(HHI), Einsteinufer 37, 10587 Berlin, Germany (email: wiegand@hhi.de).

G. J. Sullivan is with Microsoft Corporation, Redmond, WA, 98052 USA

(email: garys@ieee.org).

video offers a high-quality and immersive multimedia experi-

ence, which has only recently become feasible on consumer

electronics platforms through advances in display technology,

signal processing, transmission technology, and circuit design.

In addition to advances on the display and receiver side,

there has also been a notable increase in the production of 3D

content. The number of 3D feature film releases has been

growing dramatically each year, and several major studios

have announced that all of their future releases will be in 3D.

There are major investments being made to upgrade digital

cinema theaters with 3D capabilities, several major feature

film releases have attracted a majority of their theater revenue

in 3D showings (including Avatar, the current top grossing

feature film of all time

), and premium pricing for 3D has be-

come a significant factor in the cinema revenue model. The

push from both the production and display sides has played a

significant role in fuelling a consumer appetite for 3D video.

There are a number of challenges to overcome in making

3D video for consumer use in the home become fully practical

and show sustained market value for the long term. For one,

the usability and consumer acceptance of 3D viewing technol-

ogy will be critical. In particular, mass consumer acceptance of

the special eyewear needed to view 3D in the home with cur-

rent display technology is still relatively unknown. In general,

content creators, service providers and display manufacturers

need to ensure that the consumer has a high quality experience

and is not burdened with high transition costs or turned off by

viewing discomfort or fatigue. The availability of premium 3D

content in the home is another major factor to be considered.

These are broader issues that will significantly influence the

rate of 3D adoption and market size, but are beyond the scope

of this paper.

With regard to the delivery of 3D video, it is essential to de-

termine an appropriate data format, taking into consideration

the constraints imposed by each delivery channel – including

bit rate and compatibility requirements. Needless to say, inter-

operability through the delivery chain and among various de-

vices will be essential. The 3D representation, compression

formats, and signaling protocols will largely define the inter-

operability of the system.

For purposes of this paper, 3D video is considered to refer

to either a general n-view multiview video representation or its

Based on total revenue without inflation adjustments.

Overview of the Stereo and Multiview Video

Coding Extensions of the H.264/MPEG-4 AVC

Standard

NTHONY

ETRO

ELLOW

IEEE,

HOMAS

IEGAND

ELLOW

IEEE,

AND

ARY

ULLIVAN

ELLOW

IEEE

ROCEEDINGS OF THE

IEEE

(2011):

ETRO

IEGAND

ULLIVAN

important stereo-view special case. Efficient compression of

such data is the primary subject of this paper. The paper also

discusses stereo representation formats that could be coded

using existing 2D video coding methods – such approaches

often being referred to as frame-compatible encoding schemes.

Multiview video coding (MVC) is the process by which ste-

reo and multiview video signals are efficiently coded. The

basic approach of most MVC schemes is to exploit not only

the redundancies that exist temporally between the frames

within a given view, but also the similarities between frames of

neighboring views. By doing so, a reduction in bit rate relative

to independent coding of the views can be achieved without

sacrificing the reconstructed video quality. In this paper, the

term MVC is used interchangeably for either the general con-

cept of coding multiview video or for the particular design that

has been standardized as a recent extension of the

H.264/MPEG-4 AVC standard [1].

The topic of multiview video coding has been an active re-

search area for more than 20 years, with early work on dispar-

ity-compensated prediction by Lukacs first appearing in 1986

[2], followed by other coding schemes in the late 1980's and

early 1990's [3][4]. In 1996, the international video coding

standard H.262/MPEG-2 Video [5] was amended to support

the coding of multiview video by means of design features

originally intended for temporal scalability [6][7]. However,

the multiview extension of H.262/MPEG-2 Video was never

deployed in actual products. It was not the right time to intro-

duce 3D video into the market since the more fundamental

transition from standard-definition analog to high-definition

digital video services was a large challenge in itself. Adequate

display technology and hardware processing capabilities were

also lacking at the time. In addition to this, the H.262/MPEG-2

Video solution did not offer a very compelling compression

improvement due to limitations in the coding tools enabled for

inter-view prediction in that design [8]-[10].

This paper focuses on the MVC extension of the

H.264/MPEG-4 AVC standard. Relevant supplemental en-

hancement information (SEI) metadata and alternative ap-

proaches to enabling multiview services are also discussed.

The paper is organized as follows. Section II explains the vari-

ous multiview video applications of MVC as well as their im-

plications in terms of requirements. Section III gives the his-

tory of MVC, including prior standardization action. Sec-

tion IV briefly reviews basic design concepts of H.264/MPEG-

4 AVC. The MVC design is summarized in Section V, includ-

ing profile definitions and a summary of coding performance.

Alternative stereo representation formats and their signaling in

the H.264/MPEG-4 AVC standard are described in Section VI.

Concluding remarks are given in Section VII. For more de-

tailed information about MVC and stereo support in the

H.264/MPEG-4 AVC standard, the reader is referred to the

most recent edition of the standard itself [1], the amendment

completed in July 2008 that added the MVC extension to it

[11], and the additional amendment completed one year later

that added the Stereo High profile and frame packing arrange-

ment SEI message [12].

II. M

ULTIVIEW

CENARIOS

PPLICATIONS

AND

EQUIREMENTS

The prediction structures and coding schemes presented in

this paper have been developed and investigated in the context

of the MPEG, and later JVT, standardization project for MVC.

Therefore, most of the scenarios for multiview coding, appli-

cations and their requirements are specified by the MVC pro-

ject [13] as presented in the next sections.

A. Multiview Scenarios and Applications

The primary usage scenario for multiview video is to sup-

port 3D video applications, where 3D depth perception of a

visual scene is provided by a 3D display system. There are

many types of 3D display systems [14] including classic stereo

systems that require special-purpose glasses to more sophisti-

cated multiview auto-stereoscopic displays that do not require

glasses [15]. The stereo systems only require two views, where

a left-eye view is presented to the viewer's left eye, and a right-

eye view is presented to the viewer's right eye. The 3D display

technology and glasses ensure that the appropriate signals are

viewed by the correct eye. This is accomplished with either

passive polarization or active shutter techniques. The mul-

tiview displays have much greater data throughput require-

ments relative to conventional stereo displays in order to sup-

port a given picture resolution, since 3D is achieved by essen-

tially emitting multiple complete video sample arrays in order

to form view-dependent pictures. Such displays can be imple-

mented, for example, using conventional high-resolution dis-

plays and parallax barriers; other technologies include lenticu-

lar overlay sheets and holographic screens. Each view-

dependent video sample can be thought of as emitting a small

number of light rays in a set of discrete viewing directions –

typically between eight and a few dozen for an autostereo-

scopic display. Often these directions are distributed in a hori-

zontal plane, such that parallax effects are limited to the hori-

zontal motion of the observer. A more comprehensive review

of 3D display technologies is covered by other articles in this

special issue.

Another goal of multiview video is to enable free-viewpoint

video [16][17]. In this scenario, the viewpoint and view direc-

tion can be interactively changed. Each output view can either

be one of the input views or a virtual view that was generated

from a smaller set of multiview inputs and other data that as-

sists in the view generation process. With such a system, view-

ers can freely navigate through the different viewpoints of the

scene – within a range covered by the acquisition cameras.

Such an application of multiview video could be implemented

with conventional 2D displays. However, more advanced ver-

sions of the free-viewpoint system that work with 3D displays

could also be considered. We have already seen the use of this

functionality in broadcast production environments, e.g., to

change the viewpoint of a sports scene to show a better angle

of a play. Such functionality may also be of interest in surveil-

lance, education, gaming, and sightseeing applications. Fi-

nally, we may also imagine providing this interactive capabil-

ROCEEDINGS OF THE

IEEE

(2011):

ETRO

IEGAND

ULLIVAN

ity directly to the home viewer, e.g., for special events such as

concerts.

Another important application of multiview video is to sup-

port immersive teleconference applications. Beyond the advan-

tages provided by 3D displays, it has been reported that a tele-

conference systems could enable a more realistic communica-

tion experience when motion parallax is supported. Motion

parallax is caused by the change in the appearance of a scene

when the viewer shifts their viewing position, e.g., shifting the

viewing position to reveal occluded scene content. In an inter-

active system design, it can be possible for the transmission

system to adaptively shift its encoded viewing position to

achieve a dynamic perspective change [18][19][20]. Perspec-

tive changes can be controlled explicitly by user intervention

through a user interface control component or by a system that

senses the observer's viewing position and adjusts the dis-

played scene accordingly.

Other interesting applications of multiview video have been

demonstrated by Wilburn, et al. [21]. In this work, a high spa-

tial sampling of a scene through a large multiview video cam-

era array was used for advanced imaging. Among the capabili-

ties shown was an effective increase of bit depth and frame

rate, as well as synthetic aperture photography effects. Since

then, there have also been other exciting developments in the

area of computational imaging that rely on the acquisition of

multiview video [22].

For all of the above applications and scenarios, the storage

and transmission capacity requirements of the system are sig-

nificantly increased. Consequently, there is a strong need for

efficient multiview video compression techniques. Specific

requirements are discussed in the next subsection.

B. Standardization Requirements

The central requirement for most video coding designs is

high compression efficiency. In the specific case of MVC this

means a significant gain compared to independent compres-

sion of each view. Compression efficiency measures the trade-

off between cost (in terms of bit rate) and benefit (in terms of

video quality) – i.e. the quality at a certain bit rate or the bit

rate at a certain quality. However, compression efficiency is

not the only factor under consideration for a video coding

standard. Some requirements may even be somewhat conflict-

ing, such as desiring both good compression efficiency and

low delay. In such cases, a good trade-off needs to be found.

General requirements for video coding capabilities, such as

minimum resource consumption (memory, processing power),

low delay, error robustness, and support of a range of picture

resolutions, color sampling structures, and bit depth precisions,

tend to be applicable to nearly any video coding design.

Some requirements are specific to MVC – as highlighted in

the following. Temporal random access is a requirement for

virtually any video coding design. For MVC, view-switching

random access also becomes important. Both together ensure

that any image can be accessed, decoded, and displayed by

starting the decoder at a random access point and decoding a

relatively small quantity of data on which that image may de-

pend. Random access can be provided by insertion of pictures

that are intra-picture coded (i.e., pictures that are coded with-

out any use of prediction from other pictures). Scalability is

also a desirable feature for video coding designs. Here, we

refer to the ability of a decoder to access only a portion of a

bitstream while still being able to generate effective video out-

put – although reduced in quality to a degree commensurate

with the quantity of data in the subset used for the decoding

process. This reduction in quality may involve reduced tempo-

ral or spatial resolution, or a reduced quality of representation

at the same temporal and spatial resolution. For MVC, addi-

tionally, view scalability is desirable. In this case, a portion of

the bitstream can be accessed in order to output a subset of the

encoded views. Also, backward compatibility was required for

the MVC standard. This means that a subset of the MVC bit-

stream corresponding to one "base view" needs to be decod-

able by an ordinary (non-MVC) H.264/MPEG-4 AVC de-

coder, and the other data representing other views should be

encoded in way that will not affect that base view decoding

capability. Achieving a desired degree quality consistency

among views is also addressed – i.e., it should be possible to

control the encoding quality of the various views – for instance

to provide approximately constant quality over all views or to

select a preferential quality for encoding some views versus

others. The ability of an encoder or decoder to use parallel

processing was required to enable practical implementation

and to manage processing resources effectively. It should also

be possible to convey camera parameters (extrinsic and intrin-

sic) along with the bitstream in order to support intermediate

view interpolation at the decoder and to enable other decod-

ing-side enhanced capabilities such as multi-view feature de-

tection and classification, e.g., determining the pose of a face

within a scene, which would typically require solving a corre-

spondence problem based on the scene geometry.

Moreover, for ease of implementation, it was highly desir-

able for the MVC design to have as many design elements in

common with an ordinary H.264/MPEG-4 AVC system as

possible. Such a commonality of design components can en-

able an MVC system to be constructed rapidly from elements

of existing H.264/MPEG-4 AVC products and to be tested

more easily.

III. H

ISTORY OF

MVC

One of the earliest studies on coding of multiview images

was done by Lukacs [2]; in this work, the concept of disparity-

compensated inter-view prediction was introduced. In later

work by Dinstein, et al. [3], the predictive coding approach

was compared to 3D block transform coding for stereo image

compression. In [4], Perkins presented a transform-domain

technique for disparity-compensated prediction, as well as a

mixed-resolution coding scheme.

The first support for multiview video coding in an interna-

tional standard was in a 1996 amendment to the

H.262/MPEG-2 video coding standard [6]. It supported the

coding of two views only. In that design, the left view was

referred to as the "base view" and its encoding was compatible

with that for ordinary single-view decoders. The right view

Overview of the Stereo and Multiview Video Coding Extensions of the H.264/MPEG-4 AVC Standard

Figures

Citations

3-D Video Representation Using Depth Maps

Standardized Extensions of High Efficiency Video Coding (HEVC)

Overview of the Multiview and 3D Extensions of High Efficiency Video Coding

3D High-Efficiency Video Coding for Multi-View Video and Depth Data

Design, Implementation, and Evaluation of a Point Cloud Codec for Tele-Immersive Video

References

I and J

Overview of the H.264/AVC video coding standard

Calculation of average PSNR differences between RD-curves

Overview of the Scalable Video Coding Extension of the H.264/AVC Standard

The Scalable Video Coding Extension of the H.264/AVC Standard

Related Papers (5)

Overview of the H.264/AVC video coding standard

Overview of the High Efficiency Video Coding (HEVC) Standard

Overview of the Scalable Video Coding Extension of the H.264/AVC Standard

Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV

Calculation of average PSNR differences between RD-curves

Frequently Asked Questions (8)

Q1. What are the levels of MVC encoders?

Q2. What is the purpose of the MVC design?

Q3. What was the first call for proposals for efficient multiview video coding?

Q4. What is the main consequence of not requiring changes to lower levels of the syntax?

Q5. What is the average reduction in bit rate for a single view of stereo movie?

Q6. What are the types of 3D display systems that require glasses?

Q7. What is the way to reduce the bit rate of asymmetrical coding?

Q8. What are the main aspects of the MVC design?