Foveated Image and Video Coding
Zhou Wang and Alan C. Bovik
The human visual system (HVS) is highly space-variant in sampling, coding,
processing, and understanding of visual information. The visual sensitivity is high-
est at the point of fixation and decreases dramatically with distance from the point
of fixation. By taking advantage of this phenomenon, foveated image and video
coding systems achieve increased compression efficiency by removing considerable
high-frequency information redundancy from the regions away from the fixation
point without significant loss of the reconstructed image or video quality.
This chapter has three major purposes. The first is to introduce the back-
ground of the foveation feature of the HVS that motivates the research effort of
foveated image processing. The second is to review various foveation techniques
that have been used to construct image and video coding systems. The third is
to provide in more details a specific example of such systems, which delivers rate
scalable codestreams ordered according to foveation-based perceptual importance,
and has a wide range of potential applications such as video communications over
heterogeneous, time-varying, multi-user and interactive networks.
1.1 Foveated Human Vision and Foveated Image
Processing
Let us start by looking at the anatomy of the human eye. A simplified structure
is illustrated in Figure 1.1. The light that passes through the optics of the eye is
projected onto the retina and sampled by the photoreceptors in the retina. The
retina has two major types of photoreceptors known as cones and rods. The rods
Chapter 14 in Digital Video Image Quality and Perceptual Coding
(H. R. Wu, and K. R. Rao, eds.), Marcel Dekker Series in Signal
Processing and Communications, Nov. 2005.
2 Chapter 1. Foveated Image and Video Coding
Retina
Lens
Fovea
Pupil
Optic Nerve
Cornea
Figure 1.1: Structure of the human eye.
support achromatic vision in low level illuminations and the cone receptors are
responsible for daylight vision. The cones and rods are non-uniformly distributed
over the surface of the retina [1, 2]. The region of highest visual acuity is the fovea,
which contains no rods but has the highest concentration of approximately 50,000
cones [2]. Figure 1.2 shows the variation of the densities of photoreceptors with
retinal eccentricity, which is defined as the visual angle (in degree) between the
fovea and the location of the photoreceptor. The density of the cone cells is highest
at zero eccentricity (the fovea) and drops rapidly with increasing eccentricity. The
photoreceptors deliver data to the plexiform layers of the retina, which provide
both direct and inter-connections from the photoreceptors to the ganglion cells.
The distribution of ganglion cells is also highly non-uniform as shown in Figure
1.2. The density of the ganglion cells drops even faster than the density of the
cone receptors. The receptive fields of the ganglion cells also vary with eccentricity
[1, 2].
The density distributions of cone receptors and ganglion cells play important
roles in determining the ability of our eyes in resolving what we see. When a
human observer gazes at a point in a real-world image, a variable resolution image
is transmitted through the front visual channel into the information processing
units in the human brain. The region around the point of fixation (or foveation
point) is projected onto the fovea, sampled with the highest density, and perceived
by the observer with the highest contrast sensitivity. The sampling density and the
contrast sensitivity decrease dramatically with increasing eccentricity. An example
is shown in Figure 1.3, where Figure 1.3(a) is the original “Goldhill” image and
Figure 1.3(b) is a foveated version of that image. At certain viewing distance, if
attention is focussed at the man at the lower part of the image, then the foveated
and the original images are almost indistinguishable.
Despite the highly space-variant sampling and processing features of the HVS,
traditional digital image processing and computer vision systems represent images
on uniformly sampled rectangular lattices, which have the advantages of simple
acquisition, storage, indexing and computation. Nowadays, most digital images
1.1. Foveated Human Vision and Foveated Image Processing 3
Figure 1.2: Photoreceptor and ganglion cell density versus retinal eccentricity.
(From [1]).
and video sequences are stored, processed, transmitted and displayed in rectangu-
lar matrix format, in which each entry represents one sampling point. In recent
years, there has been growing interest in research work on foveated image process-
ing [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46], which
is targeted at a number of application fields. Significant examples include image
quality assessment [33, 38], image segmentation [24], stereo 3D scene perception
[22], volume data visualization [9], object tracking [25], and image watermarking
[42]. Nevertheless, the majority of research has been focused on foveated image and
video coding, communication and related issues. The major motivation is that con-
siderable high frequency information redundancy exists in the peripheral regions,
thus more efficient image compression can be obtained by removing or reducing
such information redundancy. As a result, the bandwidth required to transmit the
image and video information over communication channels is significantly reduced.
Foveation techniques also supply some additional benefits in visual communica-
tions. For example, in noisy communication environments, foveation provides a
natural way for unequal error-protection of different spatial regions in the image
and video streams being transmitted. Such an error-resilient coding scheme has
shown to be more robust than protecting all the image regions equally [27, 46]. For
another example, in an interactive multi-point communication environment where
information about the foveated regions at the terminals of the communication net-
works is available, higher perceptual quality images can be achieved by applying
foveated coding techniques [40].
Perfect foveation of discretely-sampled images with smoothly varying resolution
4 Chapter 1. Foveated Image and Video Coding
(a) (b)
Figure 1.3: Sample foveated image. (a) original “Goldhill” image; (b) foveated
“Glodhill” image.
turns out to be a difficult theoretical as well as implementation problem. In the
next section, we review various practical foveation techniques that approximate
perfect foveation. Section 1.3 discusses a continuously rate-scalable foveated image
and video coding system that has a number of good features in favor of network
visual communications.
1.2 Foveation Methods
The foveation approaches proposed in the literature may be roughly classified into
three categories: geometric method, filtering-based method, and multiresolution
method. These methods are closely related and the third method may be viewed
as a combination of the first two.
1.2.1 Geometric Methods
The general idea of the geometric methods is to make use of the foveated retinal
sampling geometry. We wish to associate such a highly non-uniform sampling ge-
ometry with a spatially-adaptive coordinate transform, which we call the foveation
coordinate transform. When the transform is applied to the non-uniform retinal
sampling points, uniform sampling density is obtained in the new coordinate sys-
tem. A typically used solution is the logmap transform [13] defined as
w = log(z + a) , (1.1)
1.2. Foveation Methods 5
(a) (b)
Figure 1.4: Application of foveation coordinate transform to images. (a) original
image; (b) transformed image.
where a is a constant, and z and w are complex numbers representing the positions
in the original coordinate and the transformed coordinate, respectively. While
the logmap transform is empirical, it is shown in [34] that precise mathematical
solutions of the foveation coordinate transforms may be derived directly from given
retinal sampling distributions.
The foveated retinal sampling geometry can be used in different ways. The
first method is to apply the foveation coordinate transform directly to a uniform
resolution image, thus the underlying image space is mapped onto the new coor-
dinate system as exemplified by Figure 1.4. In the transform domain, the image
is treated as a uniform resolution image, and regular uniform-resolution image
processing techniques, such as linear and non-linear filtering and compression, are
applied. Finally, the inverse coordinate transform is employed to obtain a “foveat-
edly” processed image. The difficulty with this method is that the image pixels
originally located at integer grids are moved to non-integer positions, making it dif-
ficult to index them. Interpolation and resampling procedures have to be applied
in both the transform and the inverse transform domains. These procedures not
only significantly complicate the system, but may also cause further distortions.
The second approach is the superpixel method [13, 16, 6, 15, 14], in which
local image pixel groups are averaged and mapped into superpixels, whose sizes
are determined by the retinal sampling density. Figure 1.5 shows a sophisticated
superpixel look-up table given in [13], which attempts to adhere with the logmap
structure. However, the number and variation of superpixel shapes make it incon-
venient to manipulate. In [16], a more practical superpixel method is used, where
all the superpixels have rectangular shapes. In [14], a multistage superpixel ap-