scispace - formally typeset
Open AccessProceedings ArticleDOI

Compression of stereo image pairs and streams

TLDR
In this paper, the authors exploit the correlations between 3D-stereoscopic left-right image pairs to achieve high compression factors for imageframe storage and image stream transmission, and they find extremely high correlations between left- right frames offset in time such that perspective-induced disparity between viewpoints and motion-induced parallax from a single viewpoint are nearly identical; they coin the term "WoridLine correlation" for this condition.
Abstract
We exploit the correlations between 3D-stereoscopic left-right image pairs to achieve high compression factors for imageframe storage and image stream transmission. In particular, in image stream transmission, we can find extremely highcorrelations between left-right frames offset in time such that perspective-induced disparity between viewpoints and motion-induced parallax from a single viewpoint are nearly identical; we coin the term "WoridLine correlation' for this condition.We test these ideas in two implementations, (1) straightforward computing of blockwise cross- correlations, and (2)multiresolution hierarchical matching using a wavelet- based compression method. We find that good 3D-stereoscopic imagery can be had for only a few percent more storage space or transmission bandwidth than is required for the corresponding flat imagery.1. INTRODUCTIONThe successful development of compression schemes for motion video that exploit the high correlation between temporallyadjacent frames, e.g., MPEG, suggests that we might analogously exploit the high correlation between spatially or angularlyadjacent still frames, i.e., left-right 3D-stereoscopic image pairs. If left-right pairs are selected from 3D-stereoscopic motionstreams at different times, such that perspective-induced disparity left-right and motion-induced disparity earlier-laterproduce about the same visual effect, then extremely high correlation will exist between the members of these pairs. Thiseffect, for which we coin the term "WorldLine correlation", can be exploited to achieve extremely high compression factorsfor stereo video streams.Our experiments demonstrate that a reasonable synthesis of one image of a left-right stereo image pair can be estimated fromthe other uncompressed or conventionally compressed image augmented by a small set of numbers that describe the localcross-correlations in terms of a disparity map. When the set is as small (in bits) as 1 to 2% of the conventionally compressedimage the stereoscopically viewed pair consisting of one original and one synthesized image produces convincing stereoimagery. Occlusions, for which this approach of course fails, can be handled efficiently by encoding and transmitting errormaps (residuals) of regions where a local statistical operator indicates that an occlusion is probable.Two cross-correlation mapping schemes independently developed by two of us (P.G. and S.S.) have been coded and tested,

read more

Content maybe subject to copyright    Report

Compression of stereo image pairs and streams
M. W. Siegel
1
Priyan Gunatilake
2
Sriram Sethuraman
2
A. G. Jordan
1,2
1
Robotics Institute, School of Computer Science
2
Department of Electrical and Computer Engineering
Carnegie Mellon University
5000 Forbes Ave., Pittsburgh, PA, 15213
ABSTRACT
We exploit the correlations between 3D-stereoscopic left-right image pairs to achieve high compression factors for image
frame storage and image stream transmission. In particular, in image stream transmission, we can find extremely high
correlations between left-right frames offset in time such that perspective-induced disparity between viewpoints and motion-
induced parallax from a single viewpoint are nearly identical; we coin the term "WorldLine correlation" for this condition.
We test these ideas in two implementations, (1) straightforward computing of blockwise cross- correlations, and (2)
multiresolution hierarchical matching using a wavelet- based compression method. We find that good 3D-stereoscopic
imagery can be had for only a few percent more storage space or transmission bandwidth than is required for the
corresponding flat imagery.
1. INTRODUCTION
The successful development of compression schemes for motion video that exploit the high correlation between temporally
adjacent frames, e.g., MPEG, suggests that we might analogously exploit the high correlation between spatially or angularly
adjacent still frames, i.e., left-right 3D-stereoscopic image pairs. If left-right pairs are selected from 3D-stereoscopic motion
streams at different times, such that perspective-induced disparity left-right and motion-induced disparity earlier-later
produce about the same visual effect, then extremely high correlation will exist between the members of these pairs. This
effect, for which we coin the term "WorldLine correlation", can be exploited to achieve extremely high compression factors
for stereo video streams.
Our experiments demonstrate that a reasonable synthesis of one image of a left-right stereo image pair can be estimated from
the other uncompressed or conventionally compressed image augmented by a small set of numbers that describe the local
cross-correlations in terms of a disparity map. When the set is as small (in bits) as 1 to 2% of the conventionally compressed
image the stereoscopically viewed pair consisting of one original and one synthesized image produces convincing stereo
imagery. Occlusions, for which this approach of course fails, can be handled efficiently by encoding and transmitting error
maps (residuals) of regions where a local statistical operator indicates that an occlusion is probable.
Two cross-correlation mapping schemes independently developed by two of us (P.G. and S.S.) have been coded and tested,
extensively on still image pairs and more recently on some motion video streams. Both methods yield comparable
compression factors and visual fidelity; which can be coded more efficiently, and whether either can be coded efficiently
enough to make it practical for real time use, is under study.

The method developed by P.G. is based on straightforward computing of blockwise cross-correlations; heuristics that direct
the search substantially improve efficiency at the price of occasionally finding a local maximum rather than the global
maximum.
The method developed by S.S. is based on multiresolution hierarchical matching using wavelets; efficiency is achieved by
doing the search for the best match down a tree of progressively higher resolution images, starting from a low resolution
highly subsampled image.
In the following sections we discuss the need and opportunity for compression of 3D-stereoscopic imagery, discuss the
correlations that can be exploited to achieve compression, describe and refine the approach, summarize the content and
performance of the two implementations we have prototyped to date, and outline several topics we have targeted for ongoing
research.
This paper is intended as a high level introduction to our thoughts about and our progress toward compression for 3D-
stereoscopy. The specific references that we cite in the text and the general references that we also include in the
bibliography point to background literature, as well as to three recent papers [1, 2, 3] in which we document the low level
details of our recent work.
2. NEED AND OPPORTUNITY
The scenario we imagine is that binocular 3D-stereoscopy is grafted onto "flat" (monoscopic) display infrastructures; we
regard the alternative scenario, that 3D-stereoscopy is built into the foundations of the infrastructure, as being somewhat
farfetched in light of the cost and effectiveness of the current generation of 3D display devices and systems.
Displays become rapidly more expensive as their spatial resolution and temporal frame rate increases. Thus in any
application the display is usually chosen to meet but not to exceed substantially the application’s requirements. In flat
applications each eye sees, at no cost to the other eye, the full spatial and temporal bandwidth that the display delivers. When
a 3D-stereoscopic application is grafted onto a flat infrastructure the display’s capabilities must be divided between the two
eyes. The price may be extracted in either essentially the spatial domain, e.g., by assigning the odd lines to the left eye and
the even lines to the right eye, or in essentially the temporal domain, e.g., by assigning alternate frames to the left and right
eye. The distinction is in part semantic, since the "spatial" method of this example is often implemented in practice via
sequential fields in an interlaced display system. The fundamental issue is that when 3D-stereoscopy is implemented on a
single display each eye gets in some sense only half the display. A user contemplating using 3D-stereoscopy must thus
acquire a display (and the underlying system to support it) with twice the pixel-per-second capability of the minimal display
needed for the flat application; the alternatives require choosing between a flickering image or a reduced spatial resolution
image.
As indicated, lower level capacities of the system’s components must also be doubled. In particular, all the information
captured by two cameras (each equivalent to the original camera) must be stored or transmitted or both. Doubling these
capacities may be more difficult than doubling the capability of the display, inasmuch as (except at the very high end) the
capability of the display can be increased by simply paying more. The most difficult system component to increase is
probably the bandwidth of the transmission system, which is often subject to powerful regulatory as well as technical

constraints. Nevertheless, the bandwidth must apparently be doubled to transmit 3D-stereoscopic image streams at the same
spatial resolution and temporal update frequency as either flat image stream.
In fact, because the two views comprising a 3D-stereoscopic image pair are nearly identical, i.e., the information content of
both together is only a little more than the information content of one alone, it is possible to find representations of image
pairs and streams that take up little more storage space and transmission bandwidth than the space or bandwidth that is
required by either alone. The rest of this paper is devoted to an overview of how this can be done, some details of our early
implementations, and a discussion of possibilities for the future.
2.1. Background
We remind the reader that image compression methods fall into two broad categories, "lossless" and "lossy". Lossless
compression exploits the existence of redundant or repeated information, storing the image in less space by symbolically
rather than explicitly repeating information, and by related methods such as assigning the shortest codes to the most probable
occurrences. Lossy compression exploits characteristics of the human visual system by discarding image content that is
known to have little or no impact on human perception of the image.
Our approach to compression of 3D-stereoscopic imagery has two components, related to there being two perspective views
in a 3D-stereoscopic pair. One component may be either lossless or slightly lossy, as in conventional compression of flat
imagery; the other component is by itself a very lossy (or "deep") method of compression. The intimate connection between
the two views makes it possible to synthesize a perceptually acceptable image from a compression so deep that, by itself, it
would be incomprehensible.
The left and right views that comprise a 3D-stereoscopic image pair or motion stream pair are obviously very similar. There
are various ways of saying this: they are often described as "highly redundant", in that most of the information contained in
either is repeated in the other, or as "highly correlated" in that either is for the most part easily predicted from the other by
application of some external information about the relationship (the relative perspective) between them. We can thus
synthesize a reasonable approximation to either view given the other view and a little additional information that describes
the relationship between the two views. A useful form for the additional information is a disparity map: a two dimensional
vector field that encodes how to displace blocks of pixels in one view to approximate the other view.
Fortunately a "reasonable approximation" is enough: perfection is not required. This is the case because of two
psychophysical effects, one well known, the other less so.
It is well known that one good eye and one bad eye together are better than the good eye alone, i.e., the information they
provide in a sense adds rather than averages. The resulting perception is sharper than the perception provided by the better
eye alone. Thus presenting one eye with the original view intended for it, and presenting the other eye with a synthetic view
(which might be imperfect in sharpness and perhaps even missing some small features), the perception of both together is
better than the perception of the original view alone.
A related perceptual effect that we have observed informally has been documented in several controlled experiments: a
binocular 3D-stereoscopic image pair with one sharp member and one blurred member successfully stimulate appropriate
depth perception.

Thus we expect that if one member of a 3D-stereoscopic image pair is losslessly or nearly losslessly compressed and the
other is (by some appropriate method) deeply compressed, the pair of decompressed (higher resolution) and synthesized
(lower resolution) views will together be perceived comfortably and accurately.
In the following section we describe several approaches to compression, ultimately focusing on the method we are now
developing along two complementary implementation paths.
2.2. Correlations
We identify four kinds of correlations or redundancies that can be exploited to compress 3D-stereoscopic imagery. The first
two make no specific reference to 3D-stereoscopy; they are conventional image compression methods that might
(inefficiently!) be applied to two 3D-stereoscopic views independently. The third kind applies to still image pairs, or to
temporally corresponding members of a motion stream pair. The fourth kind, which is really a combination of the second
and third kinds, applies to motion stream pairs.
Spatial correlation: Within a single frame, large areas with little variation in intensity and color permit efficient
encoding based on internal predictability, i.e., the fact that any given pixel is most likely to be identical or nearly
identical to its neighbors. This is the basis for most conventional still image compression methods.
Temporal correlation: Between frames in a motion sequence, large areas in rigid-body motion permit efficient
coding based on frame-to-frame predictability. The approach is fundamentally to transmit an occasional frame,
and interpolation coefficients that permit the receiver to synthesize reasonable approximations to the
intermediate frames. MPEG is an example.
Perspective correlation: Between frames in a binocular 3D-stereoscopic image pair, large areas differing only by
small horizontal offsets permit efficient coding based on disparity predictability. If one imagines the two
perspective views as being gathered not simultaneously but rather sequentially by moving the camera from one
viewpoint to the second, then perspective correlation and temporal correlation are to first order equivalent.
WorldLine correlation: We borrow the term "worldline" from the Theory of Special Relativity, where the
worldline is a central concept that refers to the path of an object in 4-dimensional space-time. Observers moving
relative to each other, i.e., observers having different perspectives on space-time, perceive a worldline segment
as having different spatial and temporal components, but they all agree on the length of the segment.
Analogously in 3D-stereoscopic image streams, when vertical and axial velocities are small and horizontal
motion suitably compensates perspective, time-offset frames in the left and right image streams can be nearly
identical. WorldLine correlation is the combination of temporal correlation and perspective correlation; the most
interesting manifestation of WorldLine correlation is the potential near-identity of appropriately time-offset
frames in the left and right image streams respectively.* The concept is useful for situations in which the camera
is fixed and parts of the scene are in motion, the scene is fixed and the camera is in motion, and both the camera
and parts of the scene are in motion.
WorldLine correlation is depicted pictorially in Figure 1.
*Thinking in a suitable generalized fourier domain, simultaneous pairs from different perspectives and pairs from one perspective at different times are
characterized by nearly identical amplitude spectra but substantially (although systematically) different phase spectra.

almost identical
left now right now
right later
mutually predictable
Figure 1: Pictorial depiction of WorldLine correlation.
3. APPROACH
3.1. Basic Approach
Our basic approach to compression of 3D-stereoscopic imagery is based on the observation that disparity, the relative offset
between corresponding points in an image pair, varies only slowly over most of the image field. Given the validity of this
assumption, either member of an image pair can be synthesized (or "predicted") given the other member and a low-
resolution map of the relative disparity between the two members of the pair. It is the possibility that the disparity map can
be low resolution, combined with the fact that the disparities vary slowly and can be represented by small numbers (few bits)
that permits deep compression.
As a numerical example, suppose that over most of the image field the disparity does not change significantly over eight
pixels. Then a disparity map can be represented by a field with 1/64 the number of entries as the image itself. Each disparity
is a vector with two components, horizontal and vertical, so the net compression has an upper bound of 1/32, about 3%. In
fact further significant advantages can be obtained by recognizing that the disparity components can be encoded with fewer
bits than the original intensities, e.g., perhaps three bits for the vertical disparities (four pixels up or down) and perhaps five

Citations
More filters

Supporting Video-mediated Communication over the Internet

TL;DR: This thesis contributes to the realization of a flexible framework for videomediated communication over the Internet by presenting scalable and adaptive algorithms for multicast flow control, layered video coding, and robust transport of video.
Proceedings ArticleDOI

DSIC: Deep Stereo Image Compression

TL;DR: This approach leverages state-of-the-art single-image compression autoencoders and enhances the compression with novel parametric skip functions to feed fully differentiable, disparity-warped features at all levels to the encoder/decoder of the second image.
Proceedings ArticleDOI

Stereoscopic video transmission over the Internet

TL;DR: The requirements for realising a stereoscopic visual communication system based on Internet technology are discussed and in particular a transport protocol extension is proposed.
Journal ArticleDOI

Color stereoscopic images requiring only one color image

TL;DR: Control psychophysical experiments show that subjects perceived 3-D color images even when they were presented with only one color image in a stereoscopic pair, with no depth perception degradation and only limited color degradation.
Dissertation

Scalable multi-view stereo camera array for real world real-time image capture and three-dimensional displays

TL;DR: The existence of inexpensive digital CMOS cameras are used to explore a multiimage capture paradigm and the gathering of real world real-time data of active and static scenes.
References
More filters
Journal ArticleDOI

Wavelets and signal processing

TL;DR: A simple, nonrigorous, synthetic view of wavelet theory is presented for both review and tutorial purposes, which includes nonstationary signal analysis, scale versus frequency,Wavelet analysis and synthesis, scalograms, wavelet frames and orthonormal bases, the discrete-time case, and applications of wavelets in signal processing.
Journal ArticleDOI

Data compression of stereopairs

TL;DR: It is proved that the rate distortion limit for coding stereopairs cannot in general be achieved by a coder that first codes and decodes the right picture sequence independently of the left picture sequence, and then codes anddecodes theleft picture sequence given the decoded right picture sequences.
Journal ArticleDOI

Interpolative multiresolution coding of advance television with compatible subchannels

TL;DR: A multiresolution representation for video signals is introduced and Interpolation in an FIR (finite impulse response) scheme solves uncovered area problems, considerably improving the temporal prediction.
Proceedings ArticleDOI

On stereo image coding

TL;DR: It was found that very deep compression of one of the images of a stereo pair does not interfere with the perception of depth in the stereo image.
Journal ArticleDOI

Constrained disparity and motion estimators for 3DTV image sequence coding

TL;DR: This paper presents two-dimensional motion estimation methods which take advantage of the intrinsic redundancies inside 3DTV stereoscopic image sequences, subject to the crucial assumption that an initial calibration of the stereoscopic sensors provides us with geometric change of coordinates for two matched features.
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Compression of stereo image pairs and streams" ?

In this paper, the authors exploit the correlations between 3D-stereoscopic left-right image pairs to achieve high compression factors for image frame storage and image stream transmission. 

Future research will address in the short term fine-tuning the architectures and algorithms and understanding their fundamental mathematical and psychophysical efficiencies, and in the long term issues such as multiple camera schemes and object based compression methods. 

APPROACHTheir basic approach to compression of 3D-stereoscopic imagery is based on the observation that disparity, the relative offset between corresponding points in an image pair, varies only slowly over most of the image field. 

When the set is as small (in bits) as 1 to 2% of the conventionally compressed image the stereoscopically viewed pair consisting of one original and one synthesized image produces convincing stereo imagery. 

Topics that the authors need to address in the context of compression of 3D-stereoscopic imagery include:• Optimizing implementation of the WorldLine approach.• 

The successful development of compression schemes for motion video that exploit the high correlation between temporally adjacent frames, e.g., MPEG, suggests that the authors might analogously exploit the high correlation between spatially or angularly adjacent still frames, i.e., left-right 3D-stereoscopic image pairs. 

Using three cameras: compute predictors for left and right views given the middle view, transmit the middle view and the predictors, synthesize 3D-stereoscopic views at the receiver. 

The fundamental issue is that when 3D-stereoscopy is implemented on a single display each eye gets in some sense only half the display. 

Their experiments demonstrate that a reasonable synthesis of one image of a left-right stereo image pair can be estimated from the other uncompressed or conventionally compressed image augmented by a small set of numbers that describe the local cross-correlations in terms of a disparity map. 

In fact, because the two views comprising a 3D-stereoscopic image pair are nearly identical, i.e., the information content of both together is only a little more than the information content of one alone, it is possible to find representations of image pairs and streams that take up little more storage space and transmission bandwidth than the space or bandwidth that is required by either alone. 

the bandwidth must apparently be doubled to transmit 3D-stereoscopic image streams at the same spatial resolution and temporal update frequency as either flat image stream. 

One component may be either lossless or slightly lossy, as in conventional compression of flat imagery; the other component is by itself a very lossy (or "deep") method of compression. 

The human visual perception system has an effective way to deal with occlusions: the authors have a detailed understanding of the image semantics, from which the authors effortlessly and unconsciously draw inferences that fill in the missing information. 

This is the obvious candidate for initial experiments because it is easy to code and because the authors have a strong intuitive understanding of its parameters. 

The price may be extracted in either essentially the spatial domain, e.g., by assigning the odd lines to the left eye and the even lines to the right eye, or in essentially the temporal domain, e.g., by assigning alternate frames to the left and right eye. 

Each disparity is a vector with two components, horizontal and vertical, so the net compression has an upper bound of 1/32, about 3%.