scispace - formally typeset
Open AccessJournal ArticleDOI

Overview of the Stereo and Multiview Video Coding Extensions of the H.264/MPEG-4 AVC Standard

Anthony Vetro, +2 more
- Vol. 99, Iss: 4, pp 626-642
Reads0
Chats0
TLDR
An overview of the algorithmic design used for extending H.264/MPEG-4 AVC towards MVC is provided and a summary of the coding performance achieved by MVC for both stereo- and multiview video is provided.
Abstract
Significant improvements in video compression capability have been demonstrated with the introduction of the H.264/MPEG-4 advanced video coding (AVC) standard. Since developing this standard, the Joint Video Team of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) has also standardized an extension of that technology that is referred to as multiview video coding (MVC). MVC provides a compact representation for multiple views of a video scene, such as multiple synchronized video cameras. Stereo-paired video for 3-D viewing is an important special case of MVC. The standard enables inter-view prediction to improve compression capability, as well as supporting ordinary temporal and spatial prediction. It also supports backward compatibility with existing legacy systems by structuring the MVC bitstream to include a compatible “base view.” Each other view is encoded at the same picture resolution as the base view. In recognition of its high-quality encoding capability and support for backward compatibility, the stereo high profile of the MVC extension was selected by the Blu-Ray Disc Association as the coding format for 3-D video with high-definition resolution. This paper provides an overview of the algorithmic design used for extending H.264/MPEG-4 AVC towards MVC. The basic approach of MVC for enabling inter-view prediction and view scalability in the context of H.264/MPEG-4 AVC is reviewed. Related supplemental enhancement information (SEI) metadata is also described. Various “frame compatible” approaches for support of stereo-view video as an alternative to MVC are also discussed. A summary of the coding performance achieved by MVC for both stereo- and multiview video is also provided. Future directions and challenges related to 3-D video are also briefly discussed.

read more

Content maybe subject to copyright    Report

MITSUBISHI ELECTRIC RESEARCH LABORATORIES
http://www.merl.com
Overview of the Stereo and Multiview Video
Coding Extensions of the H.264/MPEG-4
AVC Standard
Vetro, A.; Wiegand, T.; Sullivan G.J.
TR2011-022 January 2011
Abstract
Significant improvements in video compression capability have been demonstrated with the in-
troduction of the H.264/MPEG-4 Advanced Video Coding (AVC) standard. Since developing
this standard, the Joint Video Team of the ITU-T Video Coding Experts Group (VCEG) and
the ISO/IEC Moving Picture Experts Group (MPEG) has also standardized an extension of that
technology that is referred to as multiview video coding (MVC). MVC provides a compact rep-
resentation for multiple views of a video scene, such as multiple synchronized video cameras.
Stereo-paired video for 3D viewing is an important special case of MVC. The standard enables
inter-view prediction to improve compression capability, as well as supporting ordinary tempo-
ral and spatial prediction. It also supports backward compatibility with existing legacy systems
by structuring the MVC bitstream to include a compatible ”base view”. Each other view is
encoded at the same picture resolution as the base view. In recognition of its high quality en-
coding capability and support for backward compatibility, the Stereo High profile of the MVC
extension was selected by the Blu-Ray Disc Association as the coding format for 3D video with
high-definition resolution. This paper provides an overview of the algorithmic design used for
extending H.264/MPEG-4 AVC towards MVC. The basic approach of MVC for enabling inter-
view prediction and view scalability in the context of H.264/MPEG-4 AVC is reviewed. Related
supplemental enhancement information (SEI) metadata is also described. Various ”frame com-
patible” approaches for support of stereo-view video as an alternative to MVC are also discussed.
A summary of the coding performance achieved by MVC for both stereo and multiview video is
also provided. Future directions and challenges related to 3D video are also briefly discussed.
Proceedings of the IEEE
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part
without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include
the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of
the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or
republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All
rights reserved.
Copyright
c
Mitsubishi Electric Research Laboratories, Inc., 2011
201 Broadway, Cambridge, Massachusetts 02139

MERLCoverPageSide2

P
ROCEEDINGS OF THE
IEEE
(2011):
V
ETRO
,
W
IEGAND
,
S
ULLIVAN
1
Abstract—Significant improvements in video compression ca-
pability have been demonstrated with the introduction of the
H.264/MPEG-4 Advanced Video Coding (AVC) standard. Since
developing this standard, the Joint Video Team of the ITU-T Vid-
eo Coding Experts Group (VCEG) and the ISO/IEC Moving Pic-
ture Experts Group (MPEG) has also standardized an extension
of that technology that is referred to as multiview video coding
(MVC). MVC provides a compact representation for multiple
views of a video scene, such as multiple synchronized video cam-
eras. Stereo-paired video for 3D viewing is an important special
case of MVC. The standard enables inter-view prediction to im-
prove compression capability, as well as supporting ordinary
temporal and spatial prediction. It also supports backward com-
patibility with existing legacy systems by structuring the MVC
bitstream to include a compatible "base view". Each other view is
encoded at the same picture resolution as the base view. In recog-
nition of its high quality encoding capability and support for
backward compatibility, the Stereo High profile of the MVC ex-
tension was selected by the Blu-Ray Disc Association as the cod-
ing format for 3D video with high-definition resolution. This pa-
per provides an overview of the algorithmic design used for ex-
tending H.264/MPEG-4 AVC towards MVC. The basic approach
of MVC for enabling inter-view prediction and view scalability in
the context of H.264/MPEG-4 AVC is reviewed. Related supple-
mental enhancement information (SEI) metadata is also de-
scribed. Various "frame compatible" approaches for support of
stereo-view video as an alternative to MVC are also discussed. A
summary of the coding performance achieved by MVC for both
stereo and multiview video is also provided. Future directions and
challenges related to 3D video are also briefly discussed.
Index Terms—MVC, H.264, MPEG-4, AVC, standards, stereo
video, multiview video coding, inter-view prediction, 3D video,
Blu-ray Disc
I. I
NTRODUCTION
VIDEO is currently being introduced to the home
through various channels, including Blu-ray Disc,
cable and satellite transmission, terrestrial broadcast, and
streaming and download through the Internet. Today’s 3D
Manuscript received April 1, 2010.
Revised version submitted MM DD, 2010.
A. Vetro is with Mitsubishi Electric Research Labs, Cambridge, MA,
02139 USA (email: avetro@merl.com).
T. Wiegand is jointly affiliated with the Berlin Institute of technology and
the Fraunhofer Institute for Telecommunications Heinrich Hertz Institute
(HHI), Einsteinufer 37, 10587 Berlin, Germany (email: wiegand@hhi.de).
G. J. Sullivan is with Microsoft Corporation, Redmond, WA, 98052 USA
(email: garys@ieee.org).
video offers a high-quality and immersive multimedia experi-
ence, which has only recently become feasible on consumer
electronics platforms through advances in display technology,
signal processing, transmission technology, and circuit design.
In addition to advances on the display and receiver side,
there has also been a notable increase in the production of 3D
content. The number of 3D feature film releases has been
growing dramatically each year, and several major studios
have announced that all of their future releases will be in 3D.
There are major investments being made to upgrade digital
cinema theaters with 3D capabilities, several major feature
film releases have attracted a majority of their theater revenue
in 3D showings (including Avatar, the current top grossing
feature film of all time
1
), and premium pricing for 3D has be-
come a significant factor in the cinema revenue model. The
push from both the production and display sides has played a
significant role in fuelling a consumer appetite for 3D video.
There are a number of challenges to overcome in making
3D video for consumer use in the home become fully practical
and show sustained market value for the long term. For one,
the usability and consumer acceptance of 3D viewing technol-
ogy will be critical. In particular, mass consumer acceptance of
the special eyewear needed to view 3D in the home with cur-
rent display technology is still relatively unknown. In general,
content creators, service providers and display manufacturers
need to ensure that the consumer has a high quality experience
and is not burdened with high transition costs or turned off by
viewing discomfort or fatigue. The availability of premium 3D
content in the home is another major factor to be considered.
These are broader issues that will significantly influence the
rate of 3D adoption and market size, but are beyond the scope
of this paper.
With regard to the delivery of 3D video, it is essential to de-
termine an appropriate data format, taking into consideration
the constraints imposed by each delivery channel including
bit rate and compatibility requirements. Needless to say, inter-
operability through the delivery chain and among various de-
vices will be essential. The 3D representation, compression
formats, and signaling protocols will largely define the inter-
operability of the system.
For purposes of this paper, 3D video is considered to refer
to either a general n-view multiview video representation or its
1
Based on total revenue without inflation adjustments.
Overview of the Stereo and Multiview Video
Coding Extensions of the H.264/MPEG-4 AVC
Standard
A
NTHONY
V
ETRO
,
F
ELLOW
,
IEEE,
T
HOMAS
W
IEGAND
,
F
ELLOW
,
IEEE,
AND
G
ARY
J.
S
ULLIVAN
,
F
ELLOW
,
IEEE
3D

P
ROCEEDINGS OF THE
IEEE
(2011):
V
ETRO
,
W
IEGAND
,
S
ULLIVAN
2
important stereo-view special case. Efficient compression of
such data is the primary subject of this paper. The paper also
discusses stereo representation formats that could be coded
using existing 2D video coding methods such approaches
often being referred to as frame-compatible encoding schemes.
Multiview video coding (MVC) is the process by which ste-
reo and multiview video signals are efficiently coded. The
basic approach of most MVC schemes is to exploit not only
the redundancies that exist temporally between the frames
within a given view, but also the similarities between frames of
neighboring views. By doing so, a reduction in bit rate relative
to independent coding of the views can be achieved without
sacrificing the reconstructed video quality. In this paper, the
term MVC is used interchangeably for either the general con-
cept of coding multiview video or for the particular design that
has been standardized as a recent extension of the
H.264/MPEG-4 AVC standard [1].
The topic of multiview video coding has been an active re-
search area for more than 20 years, with early work on dispar-
ity-compensated prediction by Lukacs first appearing in 1986
[2], followed by other coding schemes in the late 1980's and
early 1990's [3][4]. In 1996, the international video coding
standard H.262/MPEG-2 Video [5] was amended to support
the coding of multiview video by means of design features
originally intended for temporal scalability [6][7]. However,
the multiview extension of H.262/MPEG-2 Video was never
deployed in actual products. It was not the right time to intro-
duce 3D video into the market since the more fundamental
transition from standard-definition analog to high-definition
digital video services was a large challenge in itself. Adequate
display technology and hardware processing capabilities were
also lacking at the time. In addition to this, the H.262/MPEG-2
Video solution did not offer a very compelling compression
improvement due to limitations in the coding tools enabled for
inter-view prediction in that design [8]-[10].
This paper focuses on the MVC extension of the
H.264/MPEG-4 AVC standard. Relevant supplemental en-
hancement information (SEI) metadata and alternative ap-
proaches to enabling multiview services are also discussed.
The paper is organized as follows. Section II explains the vari-
ous multiview video applications of MVC as well as their im-
plications in terms of requirements. Section III gives the his-
tory of MVC, including prior standardization action. Sec-
tion IV briefly reviews basic design concepts of H.264/MPEG-
4 AVC. The MVC design is summarized in Section V, includ-
ing profile definitions and a summary of coding performance.
Alternative stereo representation formats and their signaling in
the H.264/MPEG-4 AVC standard are described in Section VI.
Concluding remarks are given in Section VII. For more de-
tailed information about MVC and stereo support in the
H.264/MPEG-4 AVC standard, the reader is referred to the
most recent edition of the standard itself [1], the amendment
completed in July 2008 that added the MVC extension to it
[11], and the additional amendment completed one year later
that added the Stereo High profile and frame packing arrange-
ment SEI message [12].
II. M
ULTIVIEW
S
CENARIOS
,
A
PPLICATIONS
,
AND
R
EQUIREMENTS
The prediction structures and coding schemes presented in
this paper have been developed and investigated in the context
of the MPEG, and later JVT, standardization project for MVC.
Therefore, most of the scenarios for multiview coding, appli-
cations and their requirements are specified by the MVC pro-
ject [13] as presented in the next sections.
A. Multiview Scenarios and Applications
The primary usage scenario for multiview video is to sup-
port 3D video applications, where 3D depth perception of a
visual scene is provided by a 3D display system. There are
many types of 3D display systems [14] including classic stereo
systems that require special-purpose glasses to more sophisti-
cated multiview auto-stereoscopic displays that do not require
glasses [15]. The stereo systems only require two views, where
a left-eye view is presented to the viewer's left eye, and a right-
eye view is presented to the viewer's right eye. The 3D display
technology and glasses ensure that the appropriate signals are
viewed by the correct eye. This is accomplished with either
passive polarization or active shutter techniques. The mul-
tiview displays have much greater data throughput require-
ments relative to conventional stereo displays in order to sup-
port a given picture resolution, since 3D is achieved by essen-
tially emitting multiple complete video sample arrays in order
to form view-dependent pictures. Such displays can be imple-
mented, for example, using conventional high-resolution dis-
plays and parallax barriers; other technologies include lenticu-
lar overlay sheets and holographic screens. Each view-
dependent video sample can be thought of as emitting a small
number of light rays in a set of discrete viewing directions
typically between eight and a few dozen for an autostereo-
scopic display. Often these directions are distributed in a hori-
zontal plane, such that parallax effects are limited to the hori-
zontal motion of the observer. A more comprehensive review
of 3D display technologies is covered by other articles in this
special issue.
Another goal of multiview video is to enable free-viewpoint
video [16][17]. In this scenario, the viewpoint and view direc-
tion can be interactively changed. Each output view can either
be one of the input views or a virtual view that was generated
from a smaller set of multiview inputs and other data that as-
sists in the view generation process. With such a system, view-
ers can freely navigate through the different viewpoints of the
scene within a range covered by the acquisition cameras.
Such an application of multiview video could be implemented
with conventional 2D displays. However, more advanced ver-
sions of the free-viewpoint system that work with 3D displays
could also be considered. We have already seen the use of this
functionality in broadcast production environments, e.g., to
change the viewpoint of a sports scene to show a better angle
of a play. Such functionality may also be of interest in surveil-
lance, education, gaming, and sightseeing applications. Fi-
nally, we may also imagine providing this interactive capabil-

P
ROCEEDINGS OF THE
IEEE
(2011):
V
ETRO
,
W
IEGAND
,
S
ULLIVAN
3
ity directly to the home viewer, e.g., for special events such as
concerts.
Another important application of multiview video is to sup-
port immersive teleconference applications. Beyond the advan-
tages provided by 3D displays, it has been reported that a tele-
conference systems could enable a more realistic communica-
tion experience when motion parallax is supported. Motion
parallax is caused by the change in the appearance of a scene
when the viewer shifts their viewing position, e.g., shifting the
viewing position to reveal occluded scene content. In an inter-
active system design, it can be possible for the transmission
system to adaptively shift its encoded viewing position to
achieve a dynamic perspective change [18][19][20]. Perspec-
tive changes can be controlled explicitly by user intervention
through a user interface control component or by a system that
senses the observer's viewing position and adjusts the dis-
played scene accordingly.
Other interesting applications of multiview video have been
demonstrated by Wilburn, et al. [21]. In this work, a high spa-
tial sampling of a scene through a large multiview video cam-
era array was used for advanced imaging. Among the capabili-
ties shown was an effective increase of bit depth and frame
rate, as well as synthetic aperture photography effects. Since
then, there have also been other exciting developments in the
area of computational imaging that rely on the acquisition of
multiview video [22].
For all of the above applications and scenarios, the storage
and transmission capacity requirements of the system are sig-
nificantly increased. Consequently, there is a strong need for
efficient multiview video compression techniques. Specific
requirements are discussed in the next subsection.
B. Standardization Requirements
The central requirement for most video coding designs is
high compression efficiency. In the specific case of MVC this
means a significant gain compared to independent compres-
sion of each view. Compression efficiency measures the trade-
off between cost (in terms of bit rate) and benefit (in terms of
video quality) i.e. the quality at a certain bit rate or the bit
rate at a certain quality. However, compression efficiency is
not the only factor under consideration for a video coding
standard. Some requirements may even be somewhat conflict-
ing, such as desiring both good compression efficiency and
low delay. In such cases, a good trade-off needs to be found.
General requirements for video coding capabilities, such as
minimum resource consumption (memory, processing power),
low delay, error robustness, and support of a range of picture
resolutions, color sampling structures, and bit depth precisions,
tend to be applicable to nearly any video coding design.
Some requirements are specific to MVC as highlighted in
the following. Temporal random access is a requirement for
virtually any video coding design. For MVC, view-switching
random access also becomes important. Both together ensure
that any image can be accessed, decoded, and displayed by
starting the decoder at a random access point and decoding a
relatively small quantity of data on which that image may de-
pend. Random access can be provided by insertion of pictures
that are intra-picture coded (i.e., pictures that are coded with-
out any use of prediction from other pictures). Scalability is
also a desirable feature for video coding designs. Here, we
refer to the ability of a decoder to access only a portion of a
bitstream while still being able to generate effective video out-
put although reduced in quality to a degree commensurate
with the quantity of data in the subset used for the decoding
process. This reduction in quality may involve reduced tempo-
ral or spatial resolution, or a reduced quality of representation
at the same temporal and spatial resolution. For MVC, addi-
tionally, view scalability is desirable. In this case, a portion of
the bitstream can be accessed in order to output a subset of the
encoded views. Also, backward compatibility was required for
the MVC standard. This means that a subset of the MVC bit-
stream corresponding to one "base view" needs to be decod-
able by an ordinary (non-MVC) H.264/MPEG-4 AVC de-
coder, and the other data representing other views should be
encoded in way that will not affect that base view decoding
capability. Achieving a desired degree quality consistency
among views is also addressed i.e., it should be possible to
control the encoding quality of the various views – for instance
to provide approximately constant quality over all views or to
select a preferential quality for encoding some views versus
others. The ability of an encoder or decoder to use parallel
processing was required to enable practical implementation
and to manage processing resources effectively. It should also
be possible to convey camera parameters (extrinsic and intrin-
sic) along with the bitstream in order to support intermediate
view interpolation at the decoder and to enable other decod-
ing-side enhanced capabilities such as multi-view feature de-
tection and classification, e.g., determining the pose of a face
within a scene, which would typically require solving a corre-
spondence problem based on the scene geometry.
Moreover, for ease of implementation, it was highly desir-
able for the MVC design to have as many design elements in
common with an ordinary H.264/MPEG-4 AVC system as
possible. Such a commonality of design components can en-
able an MVC system to be constructed rapidly from elements
of existing H.264/MPEG-4 AVC products and to be tested
more easily.
III. H
ISTORY OF
MVC
One of the earliest studies on coding of multiview images
was done by Lukacs [2]; in this work, the concept of disparity-
compensated inter-view prediction was introduced. In later
work by Dinstein, et al. [3], the predictive coding approach
was compared to 3D block transform coding for stereo image
compression. In [4], Perkins presented a transform-domain
technique for disparity-compensated prediction, as well as a
mixed-resolution coding scheme.
The first support for multiview video coding in an interna-
tional standard was in a 1996 amendment to the
H.262/MPEG-2 video coding standard [6]. It supported the
coding of two views only. In that design, the left view was
referred to as the "base view" and its encoding was compatible
with that for ordinary single-view decoders. The right view

Citations
More filters
Journal ArticleDOI

3-D Video Representation Using Depth Maps

TL;DR: This paper describes efficient coding methods for video and depth data, and synthesis methods are presented, which mitigate errors from depth estimation and coding, for the generation of views.
Journal ArticleDOI

Standardized Extensions of High Efficiency Video Coding (HEVC)

TL;DR: The design for these extensions represents the latest state of the art for video coding and its applications, including work on range extensions for color format and bit depth enhancement, embedded-bitstream scalability, and 3D video.
Journal ArticleDOI

Overview of the Multiview and 3D Extensions of High Efficiency Video Coding

TL;DR: The more advanced 3D video extension, 3D-HEVC, targets a coded representation consisting of multiple views and associated depth maps, as required for generating additional intermediate views inAdvanced 3D displays.
Journal ArticleDOI

3D High-Efficiency Video Coding for Multi-View Video and Depth Data

TL;DR: This paper describes an extension of the high efficiency video coding (HEVC) standard for coding of multi-view video and depth data, and develops and integrated a novel encoder control that guarantees that high quality intermediate views can be generated based on the decoded data.
Journal ArticleDOI

Design, Implementation, and Evaluation of a Point Cloud Codec for Tele-Immersive Video

TL;DR: A subjective study in a state-of-the-art mixed reality system shows that introduced prediction distortions are negligible compared with the original reconstructed point clouds and shows the benefit of reconstructed point cloud video as a representation in the 3D virtual world.
References
More filters
Book ChapterDOI

I and J

Journal ArticleDOI

Overview of the H.264/AVC video coding standard

TL;DR: An overview of the technical features of H.264/AVC is provided, profiles and applications for the standard are described, and the history of the standardization process is outlined.
Journal ArticleDOI

Overview of the Scalable Video Coding Extension of the H.264/AVC Standard

TL;DR: An overview of the basic concepts for extending H.264/AVC towards SVC are provided and the basic tools for providing temporal, spatial, and quality scalability are described in detail and experimentally analyzed regarding their efficiency and complexity.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What are the levels of MVC encoders?

Levels impose constraints on the bitstreams produced by MVC encoders, to establish bounds on the necessary decoder resources and complexity. 

For applications in which random access or view switching is important, the prediction structure can be designed to minimize access delay, and the MVC design provides a way for an encoder to describe the prediction structure for this purpose. 

Considering recent advancements in video compression technology and the anticipated needs for state-of-the-art coding of multiview video, MPEG issued a Call for Proposals (CfP) for efficient multiview video coding technology in October of 2005. 

A major consequence of not requiring changes to lower levels of the syntax (at the macroblock level and below it) is that MVC is compatible with existing hardware for decoding single-view video with H.264/MPEG-4 AVC. 

In other studies [50], an average reduction of 20-30% of the bit rate for the second (dependent) view of typical stereo movie content was reported, with a peak reduction for an individual test sequence of 43% of the bit rate of the dependent view. 

There are many types of 3D display systems [14] including classic stereo systems that require special-purpose glasses to more sophisticated multiview auto-stereoscopic displays that do not require glasses [15]. 

Prior studies on asymmetrical coding of stereo video, in which one of the views is encoded with lower quality than the other, suggest that a further substantial savings in bit rate for the non-base view could be achieved using that technique. 

Several other aspects of the MVC design were further elaborated on in [44], including random access and view switching, extraction of operation points (sets of coded views at particular levels of a nested temporal referencing structure) of an MVC bitstream for adaptation to network and device constraints, parallel processing, and a description of several newly adoptedPROCEEDINGS OF THE IEEE (2011): VETRO, WIEGAND, SULLIVAN7SEI messages that are relevant for multiview video bitstreams.