Journal Article•DOI•

Saliency Detection for Stereoscopic Images

Yuming Fang¹, Junle Wang², Manish Narwaria², Patrick Le Callet², Weisi Lin³ - Show less +1 more•Institutions (3)

Jiangxi University of Finance and Economics¹, University of Nantes², Nanyang Technological University³

01 Jun 2014-IEEE Transactions on Image Processing (IEEE Trans Image Process)-Vol. 23, Iss: 6, pp 2625-2636

TL;DR: A new stereoscopic saliency detection framework based on the feature contrast of color, intensity, texture, and depth, which shows superior performance over other existing ones in saliency estimation for 3D images is proposed.

read less

Abstract: Many saliency detection models for 2D images have been proposed for various multimedia processing applications during the past decades. Currently, the emerging applications of stereoscopic display require new saliency detection models for salient region extraction. Different from saliency detection for 2D images, the depth feature has to be taken into account in saliency detection for stereoscopic images. In this paper, we propose a novel stereoscopic saliency detection framework based on the feature contrast of color, luminance, texture, and depth. Four types of features, namely color, luminance, texture, and depth, are extracted from discrete cosine transform coefficients for feature contrast calculation. A Gaussian model of the spatial distance between image patches is adopted for consideration of local and global contrast calculation. Then, a new fusion method is designed to combine the feature maps to obtain the final saliency map for stereoscopic images. In addition, we adopt the center bias factor and human visual acuity, the important characteristics of the human visual system, to enhance the final saliency map for stereoscopic images. Experimental results on eye tracking databases show the superior performance of the proposed model over other existing methods.

...read moreread less

Summary (4 min read)

Jump to: [Saliency Detection for Stereoscopic Images] – [II. RELATED WORK] – [III. THE PROPOSED MODEL] – [A. Feature Extraction] – [B. Feature Map Calculation] – [C. Saliency Estimation from Feature Map Fusion] – [D. Saliency Enhancement] – [IV. EXPERIMENT EVALUATION] – [A. Evaluation Methodology] – [B. Experiment 1: Comparison Between Different Feature Channels] – [C. Experiment 2: Comparison Between the Proposed Method and Other Existing Ones] and [V. CONCLUSION]

Saliency Detection for Stereoscopic Images

In these applications, the salient regions extracted from saliency detection models are processed specifically since they attract much more humans' attention compared with other regions.
To achieve the depth perception, binocular depth cues (such as binocular disparity) are introduced and merged together with others (such as monocular disparity) in an adaptive way based on the viewing space conditions.
The features of color, luminance, texture and depth are extracted from DCT (Discrete Cosine Transform) coefficients of image patches.
Existing 3D saliency detection models usually adopt depth information to weight the traditional 2D saliency map [19] , [20] , or combine the depth saliency map and the traditional 2D saliency map simply [21] , [23] to obtain the saliency map of 3D images.

III. THE PROPOSED MODEL

Firstly, the color, luminance, texture, and depth features are extracted from the input stereoscopic image.
Based on these features, the feature contrast is calculated for the feature map calculation.
A fusion method is designed to combine the feature maps into the saliency map.
Additionally, the authors use the centre bias factor and a model of human visual acuity to enhance the saliency map based on the characteristics of the HVS.
The authors will describe each step in detail in the following subsections.

A. Feature Extraction

The input image is divided into small image patches and then the DCT coefficients are adopted to represent the energy for each image patch.
The input RGB image is converted to YCbCr color space due to its perceptual property.
As there is little energy with the high-frequency coefficients in the right-bottom corner of the DCT block, the authors just use several first AC coefficients to represent the texture feature of image patches.
The depth map M of perceived depth information is computed based on the disparity as [23] : EQUATION ) where V represents the viewing distance of the observer; d denotes the interocular distance; P is the disparity between pixels; W and H represent the width (in cm) and horizontal resolution of the display screen, respectively.
The authors will introduce how to calculate the feature map based on these extracted features in the next subsection.

B. Feature Map Calculation

As the authors have explained before, salient regions in visual scenes pop out due to their feature contrast from their surrounding regions.
The authors estimate the saliency value of each image patch based on the feature contrast between this image path and all the other patches in the image.
For any image patch i , its saliency value is calculated based on the center-surround differences between this patch and all other patches in the image.
The weighting for the center-surround differences is determined by the spatial distances (within the Gaussian model) between image patches.
Since the color, luminance and depth features are represented by one DC coefficient for each image patch, the feature contrast from these features (luminance, color and depth) between two image patches i and j can be calculated as the difference between two DC coefficients of two corresponding image patches as follows.

C. Saliency Estimation from Feature Map Fusion

After calculating feature maps indicated in Eq. ( 2), the authors fuse these feature maps from color, luminance, texture and depth to compute the final saliency map.
It is well accepted that different visual dimensions in natural scenes are competing with each other during the combination for the final saliency map [40] , [41] .
During the fusion of different feature maps, the authors can assign more weighting for those feature maps with small and compact salient regions and less weighting for others with more spread salient regions.
Here, the authors define the measure of compactness by the spatial variance of feature maps.
Experimental results in the next section show that the proposed fusion method can obtain promising performance.

D. Saliency Enhancement

Eye tracking experiments from existing studies have shown that the bias towards the screen center exists during human fixation, which is called centre bias [43] , [44] .
The study [44] also shows that the center-bias exists during the human fixation.
The density of the cone photoreceptor cells becomes lower with larger retinal eccentricity.
The visual acuity decreases with the increased eccentricity from the fixation point [36] , [38] .
With the enhancement operation by the centre bias factor, the saliency values of center regions in images would increase, while with the enhancement operation by human visual acuity, the saliency values of non-salient regions in natural scenes would decrease and the saliency map would get visually better.

IV. EXPERIMENT EVALUATION

The authors conduct the experiments to demonstrate the performance of the proposed 3D saliency detection model.
The authors first present the evaluation methodology and quantitative evaluation metrics.
Following this, the performance comparison between different feature maps is given in subsection IV-B.
In Subsection IV-C, the authors provide the performance evaluation between the proposed method with other existing ones.

A. Evaluation Methodology

In the experiment, the authors adopt the eye tracking database [29] proposed in the study [23] to evaluate the performance of the proposed model.
This database includes 18 stereoscopic images with various types such as outdoor scenes, indoor scenes, scenes including objects, scenes without any various object, etc.
DOF is normally associated with free vision in the real applications, where objects actually exist at different distances from observers.
The data was collected by SMI RED 500 remote eye-tracker and a chin-rest was used to stabilize the observer's head.
The PLCC (Pearson Linear Correlation Coefficient), KLD (Kullback-Leibler Divergence), and AUC (Area Under the Receiver Operating Characteristics Curve) are used to evaluate the quantitative performance of the proposed stereoscopic saliency detection model.

B. Experiment 1: Comparison Between Different Feature Channels

The authors compare the performance of different feature maps from color, luminance, texture and depth.
Table I provides the quantitative comparison results for these feature maps.
Its KLD value is also higher than those from other features.
From this figure, the authors can see that the feature maps from color, luminance and depth are better than those from texture feature.
The overall saliency map by combining feature maps can obtain the best saliency estimation, as shown in Fig. 5(g) .

C. Experiment 2: Comparison Between the Proposed Method and Other Existing Ones

The authors compare the proposed 3D saliency detection model with other existing ones in [23] .
For the saliency results from the fusion model by combing 2D saliency model in [3] and depth saliency in [23] , some background regions are detected as salient regions in images, as shown in saliency maps from Fig. 6(d) .
Similarly, the 3D model by combing the proposed 2D and the DSM in [23] can get better performance than others by combing other 2D models and the DSM in [23] .
That database includes 600 stereoscopic images including indoor and outdoor scenes.
Please note that the AUC and CC values of other existing models are from the original paper [26] .

V. CONCLUSION

The authors propose a new stereoscopic saliency detection model for 3D images.
The features of color, luminance, texture and depth are extracted from DCT coefficients to represent the energy for small image patches.
The saliency is estimated based on the energy contrast weighted by a Gaussian model of spatial distances between image patches for the consideration of both local and global contrast.
A new fusion method is designed to combine the feature maps for the final saliency map.
Experimental results show the promising performance of the proposed saliency detection model for stereoscopic images based on the recent eye tracking databases.

Did you find this useful? Give us your feedback

Figures (12)

Fig. 2. Visual samples for different feature maps and saliency maps: (a) original image; (b) color feature map from Cb component; (c) color feature map from Cr component; (d) luminance feature map; (e) texture feature map; (f) depth feature map; (g) saliency map from linear combination of the feature maps with the same weighting; (h) saliency map from the proposed combination method (the weights of Cb color, Cr color, luminance, texture, and depth feature maps are 0.45, 0.51, 0.49, 0.62, and 0.81, respectively); (i) ground truth map.

TABLE IV COMPARISON BETWEEN DIFFERENT 3D SALIENCY DETECTION MODELS. ⊕ MEANS THE COMBINATION BY SIMPLE SUMMATION INTRODUCED IN THE STUDY [26]. ⊗ MEANS THE COMBINATION BY POINT-WISE MULTIPLICATION IN THE STUDY [26]. DSM REPRESENTS THE DEPTH SALIENCY MAP FROM THE STUDY [26]. IT [1], GBVS [2], AIM [3], FT [4], ICL [47], LSK [48], AND LRR [49] ARE 2D SALIENCY DETECTION MODELS

Fig. 8. The ROC curves of different stereoscopic saliency detection models. IT [1], AIM [3] and FT [4] are 2D saliency detection models.

TABLE I COMPARISON RESULTS OF PLCC, KLD AND AUC VALUES FROM DIFFERENT FEATURE MAPS: C1 FEATURE MAP: COLOR FEATURE MAP FROM Cb COMPONENT; C2 FEATURE MAP: COLOR FEATURE MAP FROM Cr COMPONENT; L FEATURE MAP: LUMINANCE FEATURE MAP; T FEATURE MAP: TEXTURE FEATURE MAP; D FEATURE MAP: DEPTH FEATURE MAP.

Fig. 4. The ROC curves of different feature maps: C1 feature map: color feature map from Cb component; C2 feature map: color feature map from Cr component; L feature map: luminance feature map; T feature map: texture feature map; D feature map: depth feature map.

Fig. 3. Visual comparison samples between the original saliency map and enhanced saliency map by centre-bias and human visual acuity. (a) Input image. (b) Original saliency map. (c) Enhanced saliency map. (d) Ground truth map.

Fig. 7. The ROC curves of different stereoscopic saliency detection models.

TABLE III CONTRIBUTION OF THE DEPTH INFORMATION ON 2D MODELS. + MEANS THE USE OF THE LINEAR POOLING STRATEGY INTRODUCED IN THE STUDY [23]. × MEANS THE WEIGHTING METHOD BASED ON MULTIPLICATION IN THE STUDY [23]. 2D REPRESENTS THE SALIENCY MAP FOR 2D IMAGES, WHILE DSM IS THE ABBREVIATION OF DEPTH SALIENCY MAP. * MEANS THAT IT IS SIGNIFICANTLY DIFFERENT FROM THE PERFORMANCE OF THE PROPOSED 3D FRAMEWORK (PAIRED T-TEST, p < 0.05)

Fig. 5. Visual comparison of saliency estimation from different features: (a) input image; (b) color feature map from Cb component; (c) color feature map from Cr component; (d) luminance feature map; (e) texture feature map; (f) depth feature map; (g) final saliency map; (h) ground truth map.

TABLE II COMPARISON RESULTS OF PLCC, KLD AND AUC VALUES FROM DIFFERENT STEREOSCOPIC-3D SALIENCY DETECTION MODELS. * MEANS THAT IT IS SIGNIFICANTLY DIFFERENT FROM THE PERFORMANCE OF THE PROPOSED MODEL (PAIRED T-TEST, p < 0.05)

Fig. 1. The framework of the proposed model.

Fig. 6. Visual comparison of stereoscopic saliency detection models. (a) Input image. (b) Model 1 in [23]. (c) Model 2 in [23]. (d) Model 3 in [23]. (e) Proposed model. (f) Ground truth map.

Content maybe subject to copyright Report

HAL Id: hal-01059986

https://hal.archives-ouvertes.fr/hal-01059986

Submitted on 15 Sep 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Saliency Detection for Stereoscopic Images

Yuming Fang, Junle Wang, Manish Narwaria, Patrick Le Callet, Weisi Lin

To cite this version:

Yuming Fang, Junle Wang, Manish Narwaria, Patrick Le Callet, Weisi Lin. Saliency Detection for

Stereoscopic Images. IEEE Transactions on Image Processing, Institute of Electrical and Electronics

Engineers, 2014, 23 (6), pp.2625–2636. �10.1109/TIP.2014.2305100�. �hal-01059986�

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014 2625

Saliency Detection for Stereoscopic Images

Yuming Fang, Member, IEEE, Junle Wang, Manish Narwaria, Patrick Le Callet, Member, IEEE,

and Weisi Lin, Senior Member, IEEE

Abstract—Many saliency detection models for 2D images have

been proposed for various multimedia processing applications

during the past decades. Currently, the emerging applications

of stereoscopic display require new saliency detection models

for salient region extraction. Different from saliency detection

for 2D images, the depth feature has to be taken into account

in saliency detection for stereoscopic images. In this paper, we

propose a novel stereoscopic saliency detection framework based

on the feature contrast of color, luminance, texture, and depth.

Four types of features, namely color, luminance, texture, and

depth, are extracted from discrete cosine transform coefﬁcients

for feature contrast calculation. A Gaussian model of the spatial

distance between image patches is adopted for consideration

of local and global contrast calculation. Then, a new fusion

method is designed to combine the feature maps to obtain the

ﬁnal saliency map for stereoscopic images. In addition, we adopt

the center bias factor and human visual acuity, the important

characteristics of the human visual system, to enhance the ﬁnal

saliency map for stereoscopic images. Experimental results on

eye tracking databases show the superior performance of the

proposed model over other existing methods.

Index Terms—Stereoscopic image, 3D image, stereoscopic

saliency detection, visual attention, human visual acuity.

I. INTRODUCTION

ISUAL attention is an important characteristic in the

Human Visual System (HVS) for visual information

processing. With large amount of visual information, visual

attention would selectively process the important part by

ﬁltering out others to reduce the complexity of scene analysis.

These important visual information is also termed as salient

regions or Regions of Interest (ROIs) in natural images. There

are two different approaches in visual attention mechanism:

bottom-up and top-down. Bottom-up approach, which is data-

driven and task-independent, is a perception process for auto-

matic salient region selection for natural scenes [1]–[8], while

top-down approach is a task-dependent cognitive processing

affected by the performed tasks, feature distribution of targets,

etc. [9]–[11].

Manuscript received June 17, 2013; revised November 14, 2013 and

January 7, 2014; accepted January 26, 2014. Date of publication February 6,

2014; date of current version May 9, 2014. The associate editor coordi-

nating the review of this manuscript and approving it for publication was

Prof. Damon M. Chandler.

Y. Fang is with the School of Information Technology, Jiangxi Uni-

versity of Finance and Economics, Nanchang 330032, China (e-mail:

fa0001ng@e.ntu.edu.sg).

J. Wang, M. Narwaria, and P. Le Callet are with LUNAM Université,

Université de Nantes, Nantes Cedex 3 44306, France (e-mail:

wang.junle@gmail.com; mani0018@e.ntu.edu.sg; patrick.lecallet@

univ-nantes.fr).

W. Lin is with the School of Computer Engineering, Nanyang Technological

University, Singapore 639798 (e-mail: wslin@ntu.edu.sg).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TIP.2014.2305100

Over the past decades, many studies have tried to pro-

pose computational models of visual attention for var-

ious multimedia processing applications, such as visual

retargeting [5], visual quality assessment [9], [13], visual

coding [14], etc. In these applications, the salient regions

extracted from saliency detection models are processed

speciﬁcally since they attract much more humans’ atten-

tion compared with other regions. Currently, many bottom-

up saliency detection models have been proposed for

2D images/videos [1]–[8].

Today, with the development of stereoscopic display, there

are various emerging applications for 3D multimedia such

as 3D video coding [31], 3D visual quality assessment

[32], [33], 3D rendering [20], etc. In the study [33], the

authors introduced the conﬂict met by the HVS while watching

3D-TV, how these conﬂicts might be limited and how visual

comfort might be improved by the visual attention model.

The study also described some other visual attention based

3D multimedia applications, which exist in different stages

of a typical 3D-TV delivery chain, such as 3D video cap-

ture, 2D to 3D conversion, reframing and depth adapta-

tion, etc. Chamaret et al. adopted ROIs for 3D rendering

in the study [20]. Overall, the emerging demand of visual

attention based applications for 3D multimedia increases the

requirement of computational saliency detection models for

3D multimedia content.

Compared with various saliency detection models proposed

for 2D images, only a few studies exploiting the 3D saliency

detection exist currently [18]–[27]. Different from saliency

detection for 2D images, the depth factor has to to be consid-

ered in saliency detection for 3D images. To achieve the depth

perception, binocular depth cues (such as binocular disparity)

are introduced and merged together with others (such as

monocular disparity) in an adaptive way based on the viewing

space conditions. However, this change of depth perception

also largely inﬂuences the human viewing behavior [39].

Therefore, how to estimate the saliency from depth cues and

how to combine the saliency from depth with those from other

2D low-level features are two important factors in designing

3D saliency detection models.

In this paper, we propose a novel saliency detection model

for 3D images based on feature contrast from color, luminance,

texture, and depth. The features of color, luminance, texture

and depth are extracted from DCT (Discrete Cosine Trans-

form) coefﬁcients of image patches. It is well accepted that the

DCT is a superior representation for energy compaction and

most of the signal information is concentrated on a few low-

frequency components [34]. Due to its energy compactness

property, the DCT has been widely used in various signal

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014

processing applications in the past decades. Our previous study

has also demonstrated that DCT coefﬁcients can be adopted

for effective feature representation in saliency detection [5].

Therefore, we use DCT coefﬁcients for feature extraction for

image patches in this study.

In essence, the input stereoscopic image and depth map are

ﬁrstly divided into small image patches. Color, luminance and

texture features are extracted based on DCT coefﬁcients of

each image patch from the original image, while depth feature

is extracted based on DCT coefﬁcients of each image patch in

the depth map. Feature contrast is calculated based on center-

surround feature difference, weighted by a Gaussian model of

spatial distances between image patches for the consideration

of local and global contrast. A new fusion method is designed

to combine the feature maps to obtain the ﬁnal saliency map

for 3D images. Additionally, inspired by the viewing inﬂuence

from centre bias and the property of human visual acuity in

the HVS, we propose to incorporate the centre bias factor

and human visual acuity into the proposed model to enhance

the saliency map. The Centre-Bias Map (CBM) calculated

based on centre bias factor and a statistical model of human

visual sensitivity in [38] are adopted to enhance the saliency

map for obtaining the ﬁnal saliency map of 3D images.

Existing 3D saliency detection models usually adopt depth

information to weight the traditional 2D saliency map [19],

[20], or combine the depth saliency map and the traditional

2D saliency map simply [21], [23] to obtain the saliency

map of 3D images. Different from these existing methods,

the proposed model adopts the low-level features of color,

luminance, texture and depth for saliency calculation in a

whole framework and designs a novel fusion method to obtain

the saliency map from feature maps. Experimental results on

eye-tracking databases demonstrate the superior performance

of the proposed model over other existing methods.

The remaining of this paper is organized as follows.

Section II introduces the related work in the literature.

In Section III, the proposed model is described in detail.

Section IV provides the experimental results on eye tracking

databases. The ﬁnal section concludes the paper.

II. R

ELATED WORK

As introduced in the previous section, many computa-

tional models of visual attention have been proposed for

various 2D multimedia processing applications. Itti et al.

proposed one of the earliest computational saliency detec-

tion models based on the neuronal architecture of the pri-

mates’ early visual system [1]. In that study, the saliency

map is calculated by feature contrast from color, intensity

and orientation. Later, Harel et al. extended Itti’s model

by using a more accurate measure of dissimilarity [2].

In that study, the graph-based theory is used to mea-

sure saliency from feature contrast. Bruce et al. designed

a saliency detection algorithm based on information max-

imization [3]. The basic theory for saliency detection is

Shannon’s self-information measure [3]. Le Meur et al.

proposed a computational model of visual attention based

on characteristics of the HVS including contrast sensitivity

functions, perceptual decomposition, visual masking, and

center-surround interactions [12].

Hou et al. proposed a saliency detection method by the con-

cept of Spectral Residual [4]. The saliency map is computed

by log spectra representation of images from Fourier Trans-

form. Based on Hou’s model, Guo et al. designed a saliency

detection algorithm based on phase spectrum, in which the

saliency map is calculated by Inverse Fourier Transform on

a constant amplitude spectrum and the original phase spec-

trum [14]. Yan et al. introduced a saliency detection algorithm

based on sparse coding [8]. Recently, some saliency detection

models have been proposed by patch-based contrast and obtain

promising performance for salient region extraction [5]–[7].

Goferman et al. introduced a context-aware saliency detection

model based on feature contrast from color and intensity in

image patches [7]. A saliency detection model in compressed

domain is designed by Fang et al. for the application of image

retargeting [5].

Besides 2D saliency detection models, several studies have

explored the saliency detection for 3D multimedia content.

In [18], Bruce at al. proposed a stereo attention framework by

extending an existing attention architecture to the binocular

domain. However, there is no computational model proposed

in that study [18]. Zhang et al. designed a stereoscopic visual

attention algorithm for 3D video based on multiple perceptual

stimuli [19]. Chamaret et al. built a Region of Interest (ROI)

extraction method for adaptive 3D rendering [20]. Both stud-

ies [19] and [20] adopt depth map to weight the 2D saliency

map to calculate the ﬁnal saliency map for 3D images. Another

method of 3D saliency detection model is built by incorporat-

ing depth saliency map into the traditional 2D saliency detec-

tion methods. In [21], Ouerhani et al. extended a 2D saliency

detection model to 3D saliency detection by taking depth cues

into account. Potapova introduced a 3D saliency detection

model for robotics tasks by incorporating the top-down cues

into the bottom-up saliency detection [22]. Lang et al. con-

ducted eye tracking experiments over 2D and 3D images for

depth saliency analysis and proposed 3D saliency detection

models by extending previous 2D saliency detection mod-

els [26]. Niu et al. explored the saliency analysis for stereo-

scopic images by extending a 2D image saliency detection

model [25]. Ciptadi et al. used the features of color and depth

to design a 3D saliency detection model for the application

of image segmentation [27]. Recently, Wang et al. proposed

a computational model of visual attention for 3D images by

extending the traditional 2D saliency detection methods. In

the study [23], the authors provided a public database with

ground-truth of eye-tracking data.

From the above description, the key of 3D saliency detection

model is how to adopt the depth cues besides the traditional

2D low-level features such as color, intensity, orientation,

etc. Previous studies from neuroscience indicate that the

depth feature would cause human beings’ attention focusing

on the salient regions as well as other low-level features

such as color, intensity, motion, etc. [15]–[17]. Therefore,

an accurate 3D saliency detection model should take depth

contrast into account as well as contrast from other common

2D low-level features. Accordingly, we propose a saliency

FANG et al.: SALIENCY DETECTION FOR STEREOSCOPIC IMAGES 2627

Fig. 1. The framework of the proposed model.

detection framework based on the feature contrast from low-

level features of color, luminance, texture and depth. A new

fusion method is designed to combine the feature maps for the

saliency estimation. Furthermore, the centre bias factor and the

human visual acuity are adopted to enhance the saliency map

for 3D images. The proposed 3D saliency detection model

can obtain promising performance for saliency estimation for

3D images, as shown in the experiment section.

III. T

HE PROPOSED MODEL

The framework of the proposed model is depicted as Fig. 1.

Firstly, the color, luminance, texture, and depth features are

extracted from the input stereoscopic image. Based on these

features, the feature contrast is calculated for the feature map

calculation. A fusion method is designed to combine the

feature maps into the saliency map. Additionally, we use the

centre bias factor and a model of human visual acuity to

enhance the saliency map based on the characteristics of the

HVS. We will describe each step in detail in the following

subsections.

A. Feature Extraction

In this study, the input image is divided into small image

patches and then the DCT coefﬁcients are adopted to represent

the energy for each image patch. Our experimental results

show that the proposed model with the patch size within

the visual angle of [0.14, 0.21] (degrees) can get promising

performance. In this paper, we use the patch size of 8 × 8

(the visual angle within the range of [0.14, 0.21] degrees) for

the saliency calculation. The used image patch size is also

the same as DCT block size in JPEG compressed images. The

input RGB image is converted to YCbCr color space due to its

perceptual property. In YCbCr color space, the Y component

represents the luminance information, while Cb and Cr are

two color-opponent components. For the DCT coefﬁcients,

DC coefﬁcients represent the average energy over all pixels in

the image patch, while AC coefﬁcients represent the detailed

frequency properties of the image patch. Thus, we use the

DC coefﬁcient of Y component to represent the luminance

feature for the image patch as L = Y

is the DC

coefﬁcient of Y component), while the DC coefﬁcients of

Cb and Cr components are adopted to represent the color

features as C

= Cb

and C

= Cr

(Cb

and Cr

are

the DC coefﬁcients from Cb and Cr components respectively).

Since the Cr and Cb components mainly include the color

information and little texture information is included in these

two channels, we use AC coefﬁcients from only Y component

to represent the texture feature of the image patch. In DCT

block, most of the energy is included in the ﬁrst several low-

frequency coefﬁcients in the left-upper corner of the DCT

block. As there is little energy with the high-frequency coefﬁ-

cients in the right-bottom corner of the DCT block, we just use

several ﬁrst AC coefﬁcients to represent the texture feature of

image patches. The existing study in [35] demonstrates that the

ﬁrst 9 low-frequency AC coefﬁcients in zig-zag scanning can

represent most energy for the detailed frequency information

in one 8 ×8 image patch. Based on the study [35], we use the

ﬁrst 9 low-frequency AC coefﬁcients to represent the texture

feature for each image patch as T ={Y

AC1

, Y

AC2

,...,Y

AC9

For the depth feature, we assume that a depth map provides

the information of the perceived depth for the scene. In a

stereoscopic display system, depth information is usually

represented by a disparity map which shows the parallax of

each pixel between the left-view and the right-view images.

The disparity is usually measured in unit of pixels for display

systems. In this study, the depth map M of perceived depth

information is computed based on the disparity as [23]:

M = V/(1 +

d · H

P · W

) (1)

where V represents the viewing distance of the observer;

d denotes the interocular distance; P is the disparity between

pixels; W and H represent the width (in cm) and horizontal

resolution of the display screen, respectively. We set the

parameters based on the experimental studies in [23].

Similar with feature extraction for color and luminance, we

adopt the DC coefﬁcients of patches in depth map calculated

in Eq. (1) as D = M

represents the DC coefﬁcient

of the image patch in depth map M).

As described above, we can extract ﬁve features of color,

luminance, texture and depth (L, C

, C

, T, D) for the input

stereoscopic image. We will introduce how to calculate the

feature map based on these extracted features in the next

subsection.

2628 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014

B. Feature Map Calculation

As we have explained before, salient regions in visual scenes

pop out due to their feature contrast from their surrounding

regions. Thus, a direct method to extract salient regions in

visual scenes is to calculate the feature contrast between image

patches and their surrounding patches in visual scenes. In this

study, we estimate the saliency value of each image patch

based on the feature contrast between this image path and

all the other patches in the image. Here, we use a Gaussian

model of spatial distance between image patches to weight the

feature contrast for saliency calculation. The saliency value F

of image patch i from feature k can be calculated as:



j=i

√

2π

/(2σ

)

(2)

where k represents the feature and k ∈{L, C

, C

, T, D};

denotes the spatial distance between image patches

i and j; U

represents the feature difference between image

patches i and j from feature k; σ is the parameter of

the Gaussian model and it determines the degree of local

and global contrast for the saliency estimation. σ is set as

5 based on the experiments of the previous work [5]. For any

image patch i, its saliency value is calculated based on the

center-surround differences between this patch and all other

patches in the image. The weighting for the center-surround

differences is determined by the spatial distances (within the

Gaussian model) between image patches. The differences from

nearer image patches will contribute more to the saliency value

of patch i than those from farther image patches. Thus, we

consider both local and global contrast from different features

in the proposed saliency detection model.

The feature difference U

between image patches i and j

is computed differently from features k due to the different

feature representation method. Since the color, luminance and

depth features are represented by one DC coefﬁcient for each

image patch, the feature contrast from these features (lumi-

nance, color and depth) between two image patches i and j can

be calculated as the difference between two DC coefﬁcients

of two corresponding image patches as follows.

− B

+ B

(3)

where B

represents the feature and B

∈{L, C

, C

, D};

the denominator is used to normalize the feature contrast.

Since texture feature is represented as 9 low-frequency

AC coefﬁcients, we calculate the feature contrast from texture

by the L2 norm. The feature contrast U



from texture feature

between two image patches i and j can be computed as

follows.









− B



)





+ B



)

(4)

where t represents the AC coefﬁcients and t ∈{1, 2, ..., 9};



represents the texture feature; the denominator is adopted

to normalize the feature contrast.

C. Saliency Estimation from Feature Map Fusion

After calculating feature maps indicated in Eq. (2), we fuse

these feature maps from color, luminance, texture and depth

to compute the ﬁnal saliency map. It is well accepted that

different visual dimensions in natural scenes are competing

with each other during the combination for the ﬁnal saliency

map [40], [41]. Existing studies have shown that a stimulus

from several saliency features is generally more conspicuous

than that from only one single feature [1], [41]. The differ-

ent visual features interact and contribute simultaneously to

the saliency of visual scenes. Currently, existing studies of

3D saliency detection (e.g. [23]) use simple linear combination

to fuse the feature maps to obtain the ﬁnal saliency map. The

weighting of the linear combination is set as constant values

and is the same for all images. To address the drawbacks from

ad-hoc weighting of linear combination for different feature

maps, we propose a new fusion method to assign adaptive

weighting for the fusion of feature maps in this study.

Generally, the salient regions in a good saliency map should

be small and compact, since the HVS always focus on some

speciﬁc interesting regions in images. Thus, a good feature

map should detect small and compact regions in the image.

During the fusion of different feature maps, we can assign

more weighting for those feature maps with small and compact

salient regions and less weighting for others with more spread

salient regions. Here, we deﬁne the measure of compactness

by the spatial variance of feature maps. The spatial variance

of feature map F

can be computed as follows.



(i, j)



(i − E

i,k

)

+ ( j − E

j,k

)

· F

(i, j)



(i, j)

(5)

where (i, j) is the spatial location in the feature map;

k represents the feature channel and k ∈{L, C

, C

, T, D};

i,k

, E

j,k

) is the average spatial location weighted by feature

response, which is calculated as:

i,k



(i, j)

i · F

(i, j)



(i, j)

(6)

j,k



(i, j)

j · F

(i, j)



(i, j)

(7)

We use the normalized υ

values to represent the compact-

ness property for feature maps. With larger spatial variance

values, the feature map is supposed to be less compact. We

calculate the compactness β

of the feature map F

as follows.

= 1/(e

) (8)

where k represents the feature channel and k ∈

{L, C

, C

, T, D}.

Based on compactness property of feature maps calculated

in Eq. (8), we fuse the feature maps for the saliency map as

follows.



· F



p=q

· β

· F

(9)

The ﬁrst term in Eq. (9) represents the linear combination

of feature maps weighted by corresponding compactness prop-

erties of feature maps; while the second term is adopted to

HTML Viewer

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Saliency detection for stereoscopic images" ?

In this paper, the authors propose a novel stereoscopic saliency detection framework based on the feature contrast of color, luminance, texture, and depth.

Q2. What is the effect of the enhancement on the saliency map?

With the enhancement operation by the centre bias factor, the saliency values of center regions in images would increase, while with the enhancement operation by human visual acuity, the saliency values of non-salient regions in natural scenes would decrease and the saliency map would get visually better.

Q3. How many low-frequency AC coefficients are used to normalize the feature contrast?

Since texture feature is represented as 9 low-frequency AC coefficients, the authors calculate the feature contrast from texture by the L2 norm.

Q4. How to build a 3D saliency detection model?

Another method of 3D saliency detection model is built by incorporating depth saliency map into the traditional 2D saliency detection methods.

Q5. How does the DC coefficient of patches in depth map work?

Similar with feature extraction for color and luminance, the authors adopt the DC coefficients of patches in depth map calculated in Eq. (1) as D = MDC (MDC represents the DC coefficient of the image patch in depth map M).

Q6. How many low-frequency AC coefficients are used to represent the texture feature of an image patch?

Based on the study [35], the authors use the first 9 low-frequency AC coefficients to represent the texture feature for each image patch as T = {YAC1, YAC2, . . . , YAC9}.

Q7. What are the measures used to evaluate the performance of the proposed stereoscopic saliency?

Among these measures, PLCC andKLD are calculated directly from the comparison between the fixation density map and the predicted saliency map, while AUC is computed from the comparison between the actual gaze points and the predicted saliency map.

Q8. What are the performance values of the proposed saliency map?

Compared with feature maps from these low-level features of color, luminance, texture and depth, the final saliency map calculated from the proposed fusion method can get much better performance for saliency estimation for 3D images, as shown by the PLCC, KLD and AUC values in Table I.

Q9. What is the retina eccentricity between the salient pixel and nonsalient pixel?

The retina eccentricity e between the salient pixel and nonsalient pixel can be computed according to its relationship with spatial distance between image pixels.

Q10. What is the proposed saliency detection model?

The proposed 3D saliency detection model can obtain promising performance for saliency estimation for 3D images, as shown in the experiment section.

Saliency Detection for Stereoscopic Images

Summary (4 min read)

Saliency Detection for Stereoscopic Images

III. THE PROPOSED MODEL

A. Feature Extraction

B. Feature Map Calculation

C. Saliency Estimation from Feature Map Fusion

D. Saliency Enhancement

IV. EXPERIMENT EVALUATION

A. Evaluation Methodology

B. Experiment 1: Comparison Between Different Feature Channels

C. Experiment 2: Comparison Between the Proposed Method and Other Existing Ones

V. CONCLUSION

Figures (12)

Citations

Cites methods from "Saliency Detection for Stereoscopic..."

Cites methods from "Saliency Detection for Stereoscopic..."

References

"Saliency Detection for Stereoscopic..." refers background or methods in this paper

"Saliency Detection for Stereoscopic..." refers background or methods in this paper

"Saliency Detection for Stereoscopic..." refers background in this paper

"Saliency Detection for Stereoscopic..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Saliency detection for stereoscopic images" ?

Q2. What is the effect of the enhancement on the saliency map?

Q3. How many low-frequency AC coefficients are used to normalize the feature contrast?

Q4. How to build a 3D saliency detection model?

Q5. How does the DC coefficient of patches in depth map work?

Q6. How many low-frequency AC coefficients are used to represent the texture feature of an image patch?

Q7. What are the measures used to evaluate the performance of the proposed stereoscopic saliency?

Q8. What are the performance values of the proposed saliency map?

Q9. What is the retina eccentricity between the salient pixel and nonsalient pixel?

Q10. What is the proposed saliency detection model?