What are the future works mentioned in the paper "Compression of stereo image pairs and streams" ?

Future research will address in the short term fine-tuning the architectures and algorithms and understanding their fundamental mathematical and psychophysical efficiencies, and in the long term issues such as multiple camera schemes and object based compression methods.

What are the main topics that the authors need to address in the context of 3D-stere?

Topics that the authors need to address in the context of compression of 3D-stereoscopic imagery include:• Optimizing implementation of the WorldLine approach.•

How do you compute predictors for left and right views?

Using three cameras: compute predictors for left and right views given the middle view, transmit the middle view and the predictors, synthesize 3D-stereoscopic views at the receiver.

What is the way to deal with occlusions?

The human visual perception system has an effective way to deal with occlusions: the authors have a detailed understanding of the image semantics, from which the authors effortlessly and unconsciously draw inferences that fill in the missing information.

What is the method for initial experiments?

This is the obvious candidate for initial experiments because it is easy to code and because the authors have a strong intuitive understanding of its parameters.

(Open Access) Compression of stereo image pairs and streams (1994) | Mel Siegel

Q: What have the authors contributed in "Compression of stereo image pairs and streams" ?

In this paper, the authors exploit the correlations between 3D-stereoscopic left-right image pairs to achieve high compression factors for image frame storage and image stream transmission.

Q: What is the basic approach to 3D-stereoscopic imagery?

APPROACHTheir basic approach to compression of 3D-stereoscopic imagery is based on the observation that disparity, the relative offset between corresponding points in an image pair, varies only slowly over most of the image field.

Q: What is the effect of a synthesis of a left-right stereo image?

Their experiments demonstrate that a reasonable synthesis of one image of a left-right stereo image pair can be estimated from the other uncompressed or conventionally compressed image augmented by a small set of numbers that describe the local cross-correlations in terms of a disparity map.

Compression of stereo image pairs and streams

M. W. Siegel

Priyan Gunatilake

Sriram Sethuraman

A. G. Jordan

1,2

Robotics Institute, School of Computer Science

Department of Electrical and Computer Engineering

Carnegie Mellon University

5000 Forbes Ave., Pittsburgh, PA, 15213

ABSTRACT

We exploit the correlations between 3D-stereoscopic left-right image pairs to achieve high compression factors for image

frame storage and image stream transmission. In particular, in image stream transmission, we can find extremely high

correlations between left-right frames offset in time such that perspective-induced disparity between viewpoints and motion-

induced parallax from a single viewpoint are nearly identical; we coin the term "WorldLine correlation" for this condition.

We test these ideas in two implementations, (1) straightforward computing of blockwise cross- correlations, and (2)

multiresolution hierarchical matching using a wavelet- based compression method. We find that good 3D-stereoscopic

imagery can be had for only a few percent more storage space or transmission bandwidth than is required for the

corresponding flat imagery.

1. INTRODUCTION

The successful development of compression schemes for motion video that exploit the high correlation between temporally

adjacent frames, e.g., MPEG, suggests that we might analogously exploit the high correlation between spatially or angularly

adjacent still frames, i.e., left-right 3D-stereoscopic image pairs. If left-right pairs are selected from 3D-stereoscopic motion

streams at different times, such that perspective-induced disparity left-right and motion-induced disparity earlier-later

produce about the same visual effect, then extremely high correlation will exist between the members of these pairs. This

effect, for which we coin the term "WorldLine correlation", can be exploited to achieve extremely high compression factors

for stereo video streams.

Our experiments demonstrate that a reasonable synthesis of one image of a left-right stereo image pair can be estimated from

the other uncompressed or conventionally compressed image augmented by a small set of numbers that describe the local

cross-correlations in terms of a disparity map. When the set is as small (in bits) as 1 to 2% of the conventionally compressed

image the stereoscopically viewed pair consisting of one original and one synthesized image produces convincing stereo

imagery. Occlusions, for which this approach of course fails, can be handled efficiently by encoding and transmitting error

maps (residuals) of regions where a local statistical operator indicates that an occlusion is probable.

Two cross-correlation mapping schemes independently developed by two of us (P.G. and S.S.) have been coded and tested,

extensively on still image pairs and more recently on some motion video streams. Both methods yield comparable

compression factors and visual fidelity; which can be coded more efficiently, and whether either can be coded efficiently

enough to make it practical for real time use, is under study.

The method developed by P.G. is based on straightforward computing of blockwise cross-correlations; heuristics that direct

the search substantially improve efficiency at the price of occasionally finding a local maximum rather than the global

maximum.

The method developed by S.S. is based on multiresolution hierarchical matching using wavelets; efficiency is achieved by

doing the search for the best match down a tree of progressively higher resolution images, starting from a low resolution

highly subsampled image.

In the following sections we discuss the need and opportunity for compression of 3D-stereoscopic imagery, discuss the

correlations that can be exploited to achieve compression, describe and refine the approach, summarize the content and

performance of the two implementations we have prototyped to date, and outline several topics we have targeted for ongoing

research.

This paper is intended as a high level introduction to our thoughts about and our progress toward compression for 3D-

stereoscopy. The specific references that we cite in the text and the general references that we also include in the

bibliography point to background literature, as well as to three recent papers [1, 2, 3] in which we document the low level

details of our recent work.

2. NEED AND OPPORTUNITY

The scenario we imagine is that binocular 3D-stereoscopy is grafted onto "flat" (monoscopic) display infrastructures; we

regard the alternative scenario, that 3D-stereoscopy is built into the foundations of the infrastructure, as being somewhat

farfetched in light of the cost and effectiveness of the current generation of 3D display devices and systems.

Displays become rapidly more expensive as their spatial resolution and temporal frame rate increases. Thus in any

application the display is usually chosen to meet but not to exceed substantially the application’s requirements. In flat

applications each eye sees, at no cost to the other eye, the full spatial and temporal bandwidth that the display delivers. When

a 3D-stereoscopic application is grafted onto a flat infrastructure the display’s capabilities must be divided between the two

eyes. The price may be extracted in either essentially the spatial domain, e.g., by assigning the odd lines to the left eye and

the even lines to the right eye, or in essentially the temporal domain, e.g., by assigning alternate frames to the left and right

eye. The distinction is in part semantic, since the "spatial" method of this example is often implemented in practice via

sequential fields in an interlaced display system. The fundamental issue is that when 3D-stereoscopy is implemented on a

single display each eye gets in some sense only half the display. A user contemplating using 3D-stereoscopy must thus

acquire a display (and the underlying system to support it) with twice the pixel-per-second capability of the minimal display

needed for the flat application; the alternatives require choosing between a flickering image or a reduced spatial resolution

image.

As indicated, lower level capacities of the system’s components must also be doubled. In particular, all the information

captured by two cameras (each equivalent to the original camera) must be stored or transmitted or both. Doubling these

capacities may be more difficult than doubling the capability of the display, inasmuch as (except at the very high end) the

capability of the display can be increased by simply paying more. The most difficult system component to increase is

probably the bandwidth of the transmission system, which is often subject to powerful regulatory as well as technical

constraints. Nevertheless, the bandwidth must apparently be doubled to transmit 3D-stereoscopic image streams at the same

spatial resolution and temporal update frequency as either flat image stream.

In fact, because the two views comprising a 3D-stereoscopic image pair are nearly identical, i.e., the information content of

both together is only a little more than the information content of one alone, it is possible to find representations of image

pairs and streams that take up little more storage space and transmission bandwidth than the space or bandwidth that is

required by either alone. The rest of this paper is devoted to an overview of how this can be done, some details of our early

implementations, and a discussion of possibilities for the future.

2.1. Background

We remind the reader that image compression methods fall into two broad categories, "lossless" and "lossy". Lossless

compression exploits the existence of redundant or repeated information, storing the image in less space by symbolically

rather than explicitly repeating information, and by related methods such as assigning the shortest codes to the most probable

occurrences. Lossy compression exploits characteristics of the human visual system by discarding image content that is

known to have little or no impact on human perception of the image.

Our approach to compression of 3D-stereoscopic imagery has two components, related to there being two perspective views

in a 3D-stereoscopic pair. One component may be either lossless or slightly lossy, as in conventional compression of flat

imagery; the other component is by itself a very lossy (or "deep") method of compression. The intimate connection between

the two views makes it possible to synthesize a perceptually acceptable image from a compression so deep that, by itself, it

would be incomprehensible.

The left and right views that comprise a 3D-stereoscopic image pair or motion stream pair are obviously very similar. There

are various ways of saying this: they are often described as "highly redundant", in that most of the information contained in

either is repeated in the other, or as "highly correlated" in that either is for the most part easily predicted from the other by

application of some external information about the relationship (the relative perspective) between them. We can thus

synthesize a reasonable approximation to either view given the other view and a little additional information that describes

the relationship between the two views. A useful form for the additional information is a disparity map: a two dimensional

vector field that encodes how to displace blocks of pixels in one view to approximate the other view.

Fortunately a "reasonable approximation" is enough: perfection is not required. This is the case because of two

psychophysical effects, one well known, the other less so.

It is well known that one good eye and one bad eye together are better than the good eye alone, i.e., the information they

provide in a sense adds rather than averages. The resulting perception is sharper than the perception provided by the better

eye alone. Thus presenting one eye with the original view intended for it, and presenting the other eye with a synthetic view

(which might be imperfect in sharpness and perhaps even missing some small features), the perception of both together is

better than the perception of the original view alone.

A related perceptual effect that we have observed informally has been documented in several controlled experiments: a

binocular 3D-stereoscopic image pair with one sharp member and one blurred member successfully stimulate appropriate

depth perception.

Thus we expect that if one member of a 3D-stereoscopic image pair is losslessly or nearly losslessly compressed and the

other is (by some appropriate method) deeply compressed, the pair of decompressed (higher resolution) and synthesized

(lower resolution) views will together be perceived comfortably and accurately.

In the following section we describe several approaches to compression, ultimately focusing on the method we are now

developing along two complementary implementation paths.

2.2. Correlations

We identify four kinds of correlations or redundancies that can be exploited to compress 3D-stereoscopic imagery. The first

two make no specific reference to 3D-stereoscopy; they are conventional image compression methods that might

(inefficiently!) be applied to two 3D-stereoscopic views independently. The third kind applies to still image pairs, or to

temporally corresponding members of a motion stream pair. The fourth kind, which is really a combination of the second

and third kinds, applies to motion stream pairs.

• Spatial correlation: Within a single frame, large areas with little variation in intensity and color permit efficient

encoding based on internal predictability, i.e., the fact that any given pixel is most likely to be identical or nearly

identical to its neighbors. This is the basis for most conventional still image compression methods.

• Temporal correlation: Between frames in a motion sequence, large areas in rigid-body motion permit efficient

coding based on frame-to-frame predictability. The approach is fundamentally to transmit an occasional frame,

and interpolation coefficients that permit the receiver to synthesize reasonable approximations to the

intermediate frames. MPEG is an example.

• Perspective correlation: Between frames in a binocular 3D-stereoscopic image pair, large areas differing only by

small horizontal offsets permit efficient coding based on disparity predictability. If one imagines the two

perspective views as being gathered not simultaneously but rather sequentially by moving the camera from one

viewpoint to the second, then perspective correlation and temporal correlation are to first order equivalent.

• WorldLine correlation: We borrow the term "worldline" from the Theory of Special Relativity, where the

worldline is a central concept that refers to the path of an object in 4-dimensional space-time. Observers moving

relative to each other, i.e., observers having different perspectives on space-time, perceive a worldline segment

as having different spatial and temporal components, but they all agree on the length of the segment.

Analogously in 3D-stereoscopic image streams, when vertical and axial velocities are small and horizontal

motion suitably compensates perspective, time-offset frames in the left and right image streams can be nearly

identical. WorldLine correlation is the combination of temporal correlation and perspective correlation; the most

interesting manifestation of WorldLine correlation is the potential near-identity of appropriately time-offset

frames in the left and right image streams respectively.* The concept is useful for situations in which the camera

is fixed and parts of the scene are in motion, the scene is fixed and the camera is in motion, and both the camera

and parts of the scene are in motion.

WorldLine correlation is depicted pictorially in Figure 1.

*Thinking in a suitable generalized fourier domain, simultaneous pairs from different perspectives and pairs from one perspective at different times are

characterized by nearly identical amplitude spectra but substantially (although systematically) different phase spectra.

almost identical

left now right now

right later

mutually predictable

Figure 1: Pictorial depiction of WorldLine correlation.

3. APPROACH

3.1. Basic Approach

Our basic approach to compression of 3D-stereoscopic imagery is based on the observation that disparity, the relative offset

between corresponding points in an image pair, varies only slowly over most of the image field. Given the validity of this

assumption, either member of an image pair can be synthesized (or "predicted") given the other member and a low-

resolution map of the relative disparity between the two members of the pair. It is the possibility that the disparity map can

be low resolution, combined with the fact that the disparities vary slowly and can be represented by small numbers (few bits)

that permits deep compression.

As a numerical example, suppose that over most of the image field the disparity does not change significantly over eight

pixels. Then a disparity map can be represented by a field with 1/64 the number of entries as the image itself. Each disparity

is a vector with two components, horizontal and vertical, so the net compression has an upper bound of 1/32, about 3%. In

fact further significant advantages can be obtained by recognizing that the disparity components can be encoded with fewer

bits than the original intensities, e.g., perhaps three bits for the vertical disparities (four pixels up or down) and perhaps five

Compression of stereo image pairs and streams

Figures

Citations

Supporting Video-mediated Communication over the Internet

DSIC: Deep Stereo Image Compression

Stereoscopic video transmission over the Internet

Color stereoscopic images requiring only one color image

Scalable multi-view stereo camera array for real world real-time image capture and three-dimensional displays

References

Wavelets and signal processing

Data compression of stereopairs

Interpolative multiresolution coding of advance television with compatible subchannels

On stereo image coding

Constrained disparity and motion estimators for 3DTV image sequence coding

Related Papers (5)

Data compression of stereopairs

On stereo image coding

Image compression method for video decoding based on motion compensation

Method and system for compression of images

Intra-coding of 360-degree images on the sphere

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Compression of stereo image pairs and streams" ?

Q2. What are the future works mentioned in the paper "Compression of stereo image pairs and streams" ?

Q3. What is the basic approach to 3D-stereoscopic imagery?

Q4. What is the effect of a stereoscopically viewed image?

Q5. What are the main topics that the authors need to address in the context of 3D-stere?

Q6. What is the way to exploit the high correlation between temporally adjacent frames?

Q7. How do you compute predictors for left and right views?

Q8. What is the fundamental issue when 3D-stereoscopy is implemented on a?

Q9. What is the effect of a synthesis of a left-right stereo image?

Q10. What is the difference between the two views?

Q11. How much bandwidth is needed to transmit 3D-stereoscopic images?

Q12. What is the difference between the two components of a 3D-stereoscopic image?

Q13. What is the way to deal with occlusions?

Q14. What is the method for initial experiments?

Q15. How does the price of a 3D-stereoscopic image be extracted?

Q16. How much is the net compression of disparity?