What contributions have the authors mentioned in the paper "Saliency detection for stereoscopic images" ?

Q: What contributions have the authors mentioned in the paper "Saliency detection for stereoscopic images" ?

In this paper, the authors propose a novel stereoscopic saliency detection framework based on the feature contrast of color, luminance, texture, and depth.

(Open Access) Saliency detection for stereoscopic images (2013) | Yuming Fang

HAL Id: hal-01059986

https://hal.archives-ouvertes.fr/hal-01059986

Submitted on 15 Sep 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Saliency Detection for Stereoscopic Images

Yuming Fang, Junle Wang, Manish Narwaria, Patrick Le Callet, Weisi Lin

To cite this version:

Yuming Fang, Junle Wang, Manish Narwaria, Patrick Le Callet, Weisi Lin. Saliency Detection for

Stereoscopic Images. IEEE Transactions on Image Processing, Institute of Electrical and Electronics

Engineers, 2014, 23 (6), pp.2625–2636. �10.1109/TIP.2014.2305100�. �hal-01059986�

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014 2625

Saliency Detection for Stereoscopic Images

Yuming Fang, Member, IEEE, Junle Wang, Manish Narwaria, Patrick Le Callet, Member, IEEE,

and Weisi Lin, Senior Member, IEEE

Abstract—Many saliency detection models for 2D images have

been proposed for various multimedia processing applications

during the past decades. Currently, the emerging applications

of stereoscopic display require new saliency detection models

for salient region extraction. Different from saliency detection

for 2D images, the depth feature has to be taken into account

in saliency detection for stereoscopic images. In this paper, we

propose a novel stereoscopic saliency detection framework based

on the feature contrast of color, luminance, texture, and depth.

Four types of features, namely color, luminance, texture, and

depth, are extracted from discrete cosine transform coefﬁcients

for feature contrast calculation. A Gaussian model of the spatial

distance between image patches is adopted for consideration

of local and global contrast calculation. Then, a new fusion

method is designed to combine the feature maps to obtain the

ﬁnal saliency map for stereoscopic images. In addition, we adopt

the center bias factor and human visual acuity, the important

characteristics of the human visual system, to enhance the ﬁnal

saliency map for stereoscopic images. Experimental results on

eye tracking databases show the superior performance of the

proposed model over other existing methods.

Index Terms—Stereoscopic image, 3D image, stereoscopic

saliency detection, visual attention, human visual acuity.

I. INTRODUCTION

ISUAL attention is an important characteristic in the

Human Visual System (HVS) for visual information

processing. With large amount of visual information, visual

attention would selectively process the important part by

ﬁltering out others to reduce the complexity of scene analysis.

These important visual information is also termed as salient

regions or Regions of Interest (ROIs) in natural images. There

are two different approaches in visual attention mechanism:

bottom-up and top-down. Bottom-up approach, which is data-

driven and task-independent, is a perception process for auto-

matic salient region selection for natural scenes [1]–[8], while

top-down approach is a task-dependent cognitive processing

affected by the performed tasks, feature distribution of targets,

etc. [9]–[11].

Manuscript received June 17, 2013; revised November 14, 2013 and

January 7, 2014; accepted January 26, 2014. Date of publication February 6,

2014; date of current version May 9, 2014. The associate editor coordi-

nating the review of this manuscript and approving it for publication was

Prof. Damon M. Chandler.

Y. Fang is with the School of Information Technology, Jiangxi Uni-

versity of Finance and Economics, Nanchang 330032, China (e-mail:

fa0001ng@e.ntu.edu.sg).

J. Wang, M. Narwaria, and P. Le Callet are with LUNAM Université,

Université de Nantes, Nantes Cedex 3 44306, France (e-mail:

wang.junle@gmail.com; mani0018@e.ntu.edu.sg; patrick.lecallet@

univ-nantes.fr).

W. Lin is with the School of Computer Engineering, Nanyang Technological

University, Singapore 639798 (e-mail: wslin@ntu.edu.sg).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TIP.2014.2305100

Over the past decades, many studies have tried to pro-

pose computational models of visual attention for var-

ious multimedia processing applications, such as visual

retargeting [5], visual quality assessment [9], [13], visual

coding [14], etc. In these applications, the salient regions

extracted from saliency detection models are processed

speciﬁcally since they attract much more humans’ atten-

tion compared with other regions. Currently, many bottom-

up saliency detection models have been proposed for

2D images/videos [1]–[8].

Today, with the development of stereoscopic display, there

are various emerging applications for 3D multimedia such

as 3D video coding [31], 3D visual quality assessment

[32], [33], 3D rendering [20], etc. In the study [33], the

authors introduced the conﬂict met by the HVS while watching

3D-TV, how these conﬂicts might be limited and how visual

comfort might be improved by the visual attention model.

The study also described some other visual attention based

3D multimedia applications, which exist in different stages

of a typical 3D-TV delivery chain, such as 3D video cap-

ture, 2D to 3D conversion, reframing and depth adapta-

tion, etc. Chamaret et al. adopted ROIs for 3D rendering

in the study [20]. Overall, the emerging demand of visual

attention based applications for 3D multimedia increases the

requirement of computational saliency detection models for

3D multimedia content.

Compared with various saliency detection models proposed

for 2D images, only a few studies exploiting the 3D saliency

detection exist currently [18]–[27]. Different from saliency

detection for 2D images, the depth factor has to to be consid-

ered in saliency detection for 3D images. To achieve the depth

perception, binocular depth cues (such as binocular disparity)

are introduced and merged together with others (such as

monocular disparity) in an adaptive way based on the viewing

space conditions. However, this change of depth perception

also largely inﬂuences the human viewing behavior [39].

Therefore, how to estimate the saliency from depth cues and

how to combine the saliency from depth with those from other

2D low-level features are two important factors in designing

3D saliency detection models.

In this paper, we propose a novel saliency detection model

for 3D images based on feature contrast from color, luminance,

texture, and depth. The features of color, luminance, texture

and depth are extracted from DCT (Discrete Cosine Trans-

form) coefﬁcients of image patches. It is well accepted that the

DCT is a superior representation for energy compaction and

most of the signal information is concentrated on a few low-

frequency components [34]. Due to its energy compactness

property, the DCT has been widely used in various signal

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014

processing applications in the past decades. Our previous study

has also demonstrated that DCT coefﬁcients can be adopted

for effective feature representation in saliency detection [5].

Therefore, we use DCT coefﬁcients for feature extraction for

image patches in this study.

In essence, the input stereoscopic image and depth map are

ﬁrstly divided into small image patches. Color, luminance and

texture features are extracted based on DCT coefﬁcients of

each image patch from the original image, while depth feature

is extracted based on DCT coefﬁcients of each image patch in

the depth map. Feature contrast is calculated based on center-

surround feature difference, weighted by a Gaussian model of

spatial distances between image patches for the consideration

of local and global contrast. A new fusion method is designed

to combine the feature maps to obtain the ﬁnal saliency map

for 3D images. Additionally, inspired by the viewing inﬂuence

from centre bias and the property of human visual acuity in

the HVS, we propose to incorporate the centre bias factor

and human visual acuity into the proposed model to enhance

the saliency map. The Centre-Bias Map (CBM) calculated

based on centre bias factor and a statistical model of human

visual sensitivity in [38] are adopted to enhance the saliency

map for obtaining the ﬁnal saliency map of 3D images.

Existing 3D saliency detection models usually adopt depth

information to weight the traditional 2D saliency map [19],

[20], or combine the depth saliency map and the traditional

2D saliency map simply [21], [23] to obtain the saliency

map of 3D images. Different from these existing methods,

the proposed model adopts the low-level features of color,

luminance, texture and depth for saliency calculation in a

whole framework and designs a novel fusion method to obtain

the saliency map from feature maps. Experimental results on

eye-tracking databases demonstrate the superior performance

of the proposed model over other existing methods.

The remaining of this paper is organized as follows.

Section II introduces the related work in the literature.

In Section III, the proposed model is described in detail.

Section IV provides the experimental results on eye tracking

databases. The ﬁnal section concludes the paper.

II. R

ELATED WORK

As introduced in the previous section, many computa-

tional models of visual attention have been proposed for

various 2D multimedia processing applications. Itti et al.

proposed one of the earliest computational saliency detec-

tion models based on the neuronal architecture of the pri-

mates’ early visual system [1]. In that study, the saliency

map is calculated by feature contrast from color, intensity

and orientation. Later, Harel et al. extended Itti’s model

by using a more accurate measure of dissimilarity [2].

In that study, the graph-based theory is used to mea-

sure saliency from feature contrast. Bruce et al. designed

a saliency detection algorithm based on information max-

imization [3]. The basic theory for saliency detection is

Shannon’s self-information measure [3]. Le Meur et al.

proposed a computational model of visual attention based

on characteristics of the HVS including contrast sensitivity

functions, perceptual decomposition, visual masking, and

center-surround interactions [12].

Hou et al. proposed a saliency detection method by the con-

cept of Spectral Residual [4]. The saliency map is computed

by log spectra representation of images from Fourier Trans-

form. Based on Hou’s model, Guo et al. designed a saliency

detection algorithm based on phase spectrum, in which the

saliency map is calculated by Inverse Fourier Transform on

a constant amplitude spectrum and the original phase spec-

trum [14]. Yan et al. introduced a saliency detection algorithm

based on sparse coding [8]. Recently, some saliency detection

models have been proposed by patch-based contrast and obtain

promising performance for salient region extraction [5]–[7].

Goferman et al. introduced a context-aware saliency detection

model based on feature contrast from color and intensity in

image patches [7]. A saliency detection model in compressed

domain is designed by Fang et al. for the application of image

retargeting [5].

Besides 2D saliency detection models, several studies have

explored the saliency detection for 3D multimedia content.

In [18], Bruce at al. proposed a stereo attention framework by

extending an existing attention architecture to the binocular

domain. However, there is no computational model proposed

in that study [18]. Zhang et al. designed a stereoscopic visual

attention algorithm for 3D video based on multiple perceptual

stimuli [19]. Chamaret et al. built a Region of Interest (ROI)

extraction method for adaptive 3D rendering [20]. Both stud-

ies [19] and [20] adopt depth map to weight the 2D saliency

map to calculate the ﬁnal saliency map for 3D images. Another

method of 3D saliency detection model is built by incorporat-

ing depth saliency map into the traditional 2D saliency detec-

tion methods. In [21], Ouerhani et al. extended a 2D saliency

detection model to 3D saliency detection by taking depth cues

into account. Potapova introduced a 3D saliency detection

model for robotics tasks by incorporating the top-down cues

into the bottom-up saliency detection [22]. Lang et al. con-

ducted eye tracking experiments over 2D and 3D images for

depth saliency analysis and proposed 3D saliency detection

models by extending previous 2D saliency detection mod-

els [26]. Niu et al. explored the saliency analysis for stereo-

scopic images by extending a 2D image saliency detection

model [25]. Ciptadi et al. used the features of color and depth

to design a 3D saliency detection model for the application

of image segmentation [27]. Recently, Wang et al. proposed

a computational model of visual attention for 3D images by

extending the traditional 2D saliency detection methods. In

the study [23], the authors provided a public database with

ground-truth of eye-tracking data.

From the above description, the key of 3D saliency detection

model is how to adopt the depth cues besides the traditional

2D low-level features such as color, intensity, orientation,

etc. Previous studies from neuroscience indicate that the

depth feature would cause human beings’ attention focusing

on the salient regions as well as other low-level features

such as color, intensity, motion, etc. [15]–[17]. Therefore,

an accurate 3D saliency detection model should take depth

contrast into account as well as contrast from other common

2D low-level features. Accordingly, we propose a saliency

FANG et al.: SALIENCY DETECTION FOR STEREOSCOPIC IMAGES 2627

Fig. 1. The framework of the proposed model.

detection framework based on the feature contrast from low-

level features of color, luminance, texture and depth. A new

fusion method is designed to combine the feature maps for the

saliency estimation. Furthermore, the centre bias factor and the

human visual acuity are adopted to enhance the saliency map

for 3D images. The proposed 3D saliency detection model

can obtain promising performance for saliency estimation for

3D images, as shown in the experiment section.

III. T

HE PROPOSED MODEL

The framework of the proposed model is depicted as Fig. 1.

Firstly, the color, luminance, texture, and depth features are

extracted from the input stereoscopic image. Based on these

features, the feature contrast is calculated for the feature map

calculation. A fusion method is designed to combine the

feature maps into the saliency map. Additionally, we use the

centre bias factor and a model of human visual acuity to

enhance the saliency map based on the characteristics of the

HVS. We will describe each step in detail in the following

subsections.

A. Feature Extraction

In this study, the input image is divided into small image

patches and then the DCT coefﬁcients are adopted to represent

the energy for each image patch. Our experimental results

show that the proposed model with the patch size within

the visual angle of [0.14, 0.21] (degrees) can get promising

performance. In this paper, we use the patch size of 8 × 8

(the visual angle within the range of [0.14, 0.21] degrees) for

the saliency calculation. The used image patch size is also

the same as DCT block size in JPEG compressed images. The

input RGB image is converted to YCbCr color space due to its

perceptual property. In YCbCr color space, the Y component

represents the luminance information, while Cb and Cr are

two color-opponent components. For the DCT coefﬁcients,

DC coefﬁcients represent the average energy over all pixels in

the image patch, while AC coefﬁcients represent the detailed

frequency properties of the image patch. Thus, we use the

DC coefﬁcient of Y component to represent the luminance

feature for the image patch as L = Y

is the DC

coefﬁcient of Y component), while the DC coefﬁcients of

Cb and Cr components are adopted to represent the color

features as C

= Cb

and C

= Cr

(Cb

and Cr

are

the DC coefﬁcients from Cb and Cr components respectively).

Since the Cr and Cb components mainly include the color

information and little texture information is included in these

two channels, we use AC coefﬁcients from only Y component

to represent the texture feature of the image patch. In DCT

block, most of the energy is included in the ﬁrst several low-

frequency coefﬁcients in the left-upper corner of the DCT

block. As there is little energy with the high-frequency coefﬁ-

cients in the right-bottom corner of the DCT block, we just use

several ﬁrst AC coefﬁcients to represent the texture feature of

image patches. The existing study in [35] demonstrates that the

ﬁrst 9 low-frequency AC coefﬁcients in zig-zag scanning can

represent most energy for the detailed frequency information

in one 8 ×8 image patch. Based on the study [35], we use the

ﬁrst 9 low-frequency AC coefﬁcients to represent the texture

feature for each image patch as T ={Y

AC1

, Y

AC2

,...,Y

AC9

For the depth feature, we assume that a depth map provides

the information of the perceived depth for the scene. In a

stereoscopic display system, depth information is usually

represented by a disparity map which shows the parallax of

each pixel between the left-view and the right-view images.

The disparity is usually measured in unit of pixels for display

systems. In this study, the depth map M of perceived depth

information is computed based on the disparity as [23]:

M = V/(1 +

d · H

P · W

) (1)

where V represents the viewing distance of the observer;

d denotes the interocular distance; P is the disparity between

pixels; W and H represent the width (in cm) and horizontal

resolution of the display screen, respectively. We set the

parameters based on the experimental studies in [23].

Similar with feature extraction for color and luminance, we

adopt the DC coefﬁcients of patches in depth map calculated

in Eq. (1) as D = M

represents the DC coefﬁcient

of the image patch in depth map M).

As described above, we can extract ﬁve features of color,

luminance, texture and depth (L, C

, C

, T, D) for the input

stereoscopic image. We will introduce how to calculate the

feature map based on these extracted features in the next

subsection.

2628 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014

B. Feature Map Calculation

As we have explained before, salient regions in visual scenes

pop out due to their feature contrast from their surrounding

regions. Thus, a direct method to extract salient regions in

visual scenes is to calculate the feature contrast between image

patches and their surrounding patches in visual scenes. In this

study, we estimate the saliency value of each image patch

based on the feature contrast between this image path and

all the other patches in the image. Here, we use a Gaussian

model of spatial distance between image patches to weight the

feature contrast for saliency calculation. The saliency value F

of image patch i from feature k can be calculated as:



j=i

√

2π

/(2σ

)

(2)

where k represents the feature and k ∈{L, C

, C

, T, D};

denotes the spatial distance between image patches

i and j; U

represents the feature difference between image

patches i and j from feature k; σ is the parameter of

the Gaussian model and it determines the degree of local

and global contrast for the saliency estimation. σ is set as

5 based on the experiments of the previous work [5]. For any

image patch i, its saliency value is calculated based on the

center-surround differences between this patch and all other

patches in the image. The weighting for the center-surround

differences is determined by the spatial distances (within the

Gaussian model) between image patches. The differences from

nearer image patches will contribute more to the saliency value

of patch i than those from farther image patches. Thus, we

consider both local and global contrast from different features

in the proposed saliency detection model.

The feature difference U

between image patches i and j

is computed differently from features k due to the different

feature representation method. Since the color, luminance and

depth features are represented by one DC coefﬁcient for each

image patch, the feature contrast from these features (lumi-

nance, color and depth) between two image patches i and j can

be calculated as the difference between two DC coefﬁcients

of two corresponding image patches as follows.

− B

+ B

(3)

where B

represents the feature and B

∈{L, C

, C

, D};

the denominator is used to normalize the feature contrast.

Since texture feature is represented as 9 low-frequency

AC coefﬁcients, we calculate the feature contrast from texture

by the L2 norm. The feature contrast U



from texture feature

between two image patches i and j can be computed as

follows.









− B



)





+ B



)

(4)

where t represents the AC coefﬁcients and t ∈{1, 2, ..., 9};



represents the texture feature; the denominator is adopted

to normalize the feature contrast.

C. Saliency Estimation from Feature Map Fusion

After calculating feature maps indicated in Eq. (2), we fuse

these feature maps from color, luminance, texture and depth

to compute the ﬁnal saliency map. It is well accepted that

different visual dimensions in natural scenes are competing

with each other during the combination for the ﬁnal saliency

map [40], [41]. Existing studies have shown that a stimulus

from several saliency features is generally more conspicuous

than that from only one single feature [1], [41]. The differ-

ent visual features interact and contribute simultaneously to

the saliency of visual scenes. Currently, existing studies of

3D saliency detection (e.g. [23]) use simple linear combination

to fuse the feature maps to obtain the ﬁnal saliency map. The

weighting of the linear combination is set as constant values

and is the same for all images. To address the drawbacks from

ad-hoc weighting of linear combination for different feature

maps, we propose a new fusion method to assign adaptive

weighting for the fusion of feature maps in this study.

Generally, the salient regions in a good saliency map should

be small and compact, since the HVS always focus on some

speciﬁc interesting regions in images. Thus, a good feature

map should detect small and compact regions in the image.

During the fusion of different feature maps, we can assign

more weighting for those feature maps with small and compact

salient regions and less weighting for others with more spread

salient regions. Here, we deﬁne the measure of compactness

by the spatial variance of feature maps. The spatial variance

of feature map F

can be computed as follows.



(i, j)



(i − E

i,k

)

+ ( j − E

j,k

)

· F

(i, j)



(i, j)

(5)

where (i, j) is the spatial location in the feature map;

k represents the feature channel and k ∈{L, C

, C

, T, D};

i,k

, E

j,k

) is the average spatial location weighted by feature

response, which is calculated as:

i,k



(i, j)

i · F

(i, j)



(i, j)

(6)

j,k



(i, j)

j · F

(i, j)



(i, j)

(7)

We use the normalized υ

values to represent the compact-

ness property for feature maps. With larger spatial variance

values, the feature map is supposed to be less compact. We

calculate the compactness β

of the feature map F

as follows.

= 1/(e

) (8)

where k represents the feature channel and k ∈

{L, C

, C

, T, D}.

Based on compactness property of feature maps calculated

in Eq. (8), we fuse the feature maps for the saliency map as

follows.



· F



p=q

· β

· F

(9)

The ﬁrst term in Eq. (9) represents the linear combination

of feature maps weighted by corresponding compactness prop-

erties of feature maps; while the second term is adopted to

Saliency detection for stereoscopic images

Figures

Citations

Contrast Prior and Fluid Pyramid Integration for RGBD Salient Object Detection

Salient object detection in RGB-D image based on saliency fusion and propagation

Depth incorporating with color improves salient object detection

Relative location for light field saliency detection

A Novel Saliency Model for Stereoscopic Images

References

A feature-integration theory of attention

A model of saliency-based visual attention for rapid scene analysis

A model of saliency-based visual attention for rapid scene analysis

Saliency Detection: A Spectral Residual Approach

What attributes guide the deployment of visual attention and how do they do it

Related Papers (5)

Saliency Detection for Stereoscopic Images

Salient region detection for stereoscopic images

Real-Time Visual Saliency Detection Using Gaussian Distribution

A novel method for salient object detection via compactness measurement

Salient Object Detection through Over-Segmentation

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Saliency detection for stereoscopic images" ?