scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Saliency Detection for Stereoscopic Images

01 Jun 2014-IEEE Transactions on Image Processing (IEEE Trans Image Process)-Vol. 23, Iss: 6, pp 2625-2636
TL;DR: A new stereoscopic saliency detection framework based on the feature contrast of color, intensity, texture, and depth, which shows superior performance over other existing ones in saliency estimation for 3D images is proposed.
Abstract: Many saliency detection models for 2D images have been proposed for various multimedia processing applications during the past decades. Currently, the emerging applications of stereoscopic display require new saliency detection models for salient region extraction. Different from saliency detection for 2D images, the depth feature has to be taken into account in saliency detection for stereoscopic images. In this paper, we propose a novel stereoscopic saliency detection framework based on the feature contrast of color, luminance, texture, and depth. Four types of features, namely color, luminance, texture, and depth, are extracted from discrete cosine transform coefficients for feature contrast calculation. A Gaussian model of the spatial distance between image patches is adopted for consideration of local and global contrast calculation. Then, a new fusion method is designed to combine the feature maps to obtain the final saliency map for stereoscopic images. In addition, we adopt the center bias factor and human visual acuity, the important characteristics of the human visual system, to enhance the final saliency map for stereoscopic images. Experimental results on eye tracking databases show the superior performance of the proposed model over other existing methods.

Summary (4 min read)

Saliency Detection for Stereoscopic Images

  • In these applications, the salient regions extracted from saliency detection models are processed specifically since they attract much more humans' attention compared with other regions.
  • To achieve the depth perception, binocular depth cues (such as binocular disparity) are introduced and merged together with others (such as monocular disparity) in an adaptive way based on the viewing space conditions.
  • The features of color, luminance, texture and depth are extracted from DCT (Discrete Cosine Transform) coefficients of image patches.
  • Existing 3D saliency detection models usually adopt depth information to weight the traditional 2D saliency map [19] , [20] , or combine the depth saliency map and the traditional 2D saliency map simply [21] , [23] to obtain the saliency map of 3D images.

III. THE PROPOSED MODEL

  • Firstly, the color, luminance, texture, and depth features are extracted from the input stereoscopic image.
  • Based on these features, the feature contrast is calculated for the feature map calculation.
  • A fusion method is designed to combine the feature maps into the saliency map.
  • Additionally, the authors use the centre bias factor and a model of human visual acuity to enhance the saliency map based on the characteristics of the HVS.
  • The authors will describe each step in detail in the following subsections.

A. Feature Extraction

  • The input image is divided into small image patches and then the DCT coefficients are adopted to represent the energy for each image patch.
  • The input RGB image is converted to YCbCr color space due to its perceptual property.
  • As there is little energy with the high-frequency coefficients in the right-bottom corner of the DCT block, the authors just use several first AC coefficients to represent the texture feature of image patches.
  • The depth map M of perceived depth information is computed based on the disparity as [23] : EQUATION ) where V represents the viewing distance of the observer; d denotes the interocular distance; P is the disparity between pixels; W and H represent the width (in cm) and horizontal resolution of the display screen, respectively.
  • The authors will introduce how to calculate the feature map based on these extracted features in the next subsection.

B. Feature Map Calculation

  • As the authors have explained before, salient regions in visual scenes pop out due to their feature contrast from their surrounding regions.
  • The authors estimate the saliency value of each image patch based on the feature contrast between this image path and all the other patches in the image.
  • For any image patch i , its saliency value is calculated based on the center-surround differences between this patch and all other patches in the image.
  • The weighting for the center-surround differences is determined by the spatial distances (within the Gaussian model) between image patches.
  • Since the color, luminance and depth features are represented by one DC coefficient for each image patch, the feature contrast from these features (luminance, color and depth) between two image patches i and j can be calculated as the difference between two DC coefficients of two corresponding image patches as follows.

C. Saliency Estimation from Feature Map Fusion

  • After calculating feature maps indicated in Eq. ( 2), the authors fuse these feature maps from color, luminance, texture and depth to compute the final saliency map.
  • It is well accepted that different visual dimensions in natural scenes are competing with each other during the combination for the final saliency map [40] , [41] .
  • During the fusion of different feature maps, the authors can assign more weighting for those feature maps with small and compact salient regions and less weighting for others with more spread salient regions.
  • Here, the authors define the measure of compactness by the spatial variance of feature maps.
  • Experimental results in the next section show that the proposed fusion method can obtain promising performance.

D. Saliency Enhancement

  • Eye tracking experiments from existing studies have shown that the bias towards the screen center exists during human fixation, which is called centre bias [43] , [44] .
  • The study [44] also shows that the center-bias exists during the human fixation.
  • The density of the cone photoreceptor cells becomes lower with larger retinal eccentricity.
  • The visual acuity decreases with the increased eccentricity from the fixation point [36] , [38] .
  • With the enhancement operation by the centre bias factor, the saliency values of center regions in images would increase, while with the enhancement operation by human visual acuity, the saliency values of non-salient regions in natural scenes would decrease and the saliency map would get visually better.

IV. EXPERIMENT EVALUATION

  • The authors conduct the experiments to demonstrate the performance of the proposed 3D saliency detection model.
  • The authors first present the evaluation methodology and quantitative evaluation metrics.
  • Following this, the performance comparison between different feature maps is given in subsection IV-B.
  • In Subsection IV-C, the authors provide the performance evaluation between the proposed method with other existing ones.

A. Evaluation Methodology

  • In the experiment, the authors adopt the eye tracking database [29] proposed in the study [23] to evaluate the performance of the proposed model.
  • This database includes 18 stereoscopic images with various types such as outdoor scenes, indoor scenes, scenes including objects, scenes without any various object, etc.
  • DOF is normally associated with free vision in the real applications, where objects actually exist at different distances from observers.
  • The data was collected by SMI RED 500 remote eye-tracker and a chin-rest was used to stabilize the observer's head.
  • The PLCC (Pearson Linear Correlation Coefficient), KLD (Kullback-Leibler Divergence), and AUC (Area Under the Receiver Operating Characteristics Curve) are used to evaluate the quantitative performance of the proposed stereoscopic saliency detection model.

B. Experiment 1: Comparison Between Different Feature Channels

  • The authors compare the performance of different feature maps from color, luminance, texture and depth.
  • Table I provides the quantitative comparison results for these feature maps.
  • Its KLD value is also higher than those from other features.
  • From this figure, the authors can see that the feature maps from color, luminance and depth are better than those from texture feature.
  • The overall saliency map by combining feature maps can obtain the best saliency estimation, as shown in Fig. 5(g) .

C. Experiment 2: Comparison Between the Proposed Method and Other Existing Ones

  • The authors compare the proposed 3D saliency detection model with other existing ones in [23] .
  • For the saliency results from the fusion model by combing 2D saliency model in [3] and depth saliency in [23] , some background regions are detected as salient regions in images, as shown in saliency maps from Fig. 6(d) .
  • Similarly, the 3D model by combing the proposed 2D and the DSM in [23] can get better performance than others by combing other 2D models and the DSM in [23] .
  • That database includes 600 stereoscopic images including indoor and outdoor scenes.
  • Please note that the AUC and CC values of other existing models are from the original paper [26] .

V. CONCLUSION

  • The authors propose a new stereoscopic saliency detection model for 3D images.
  • The features of color, luminance, texture and depth are extracted from DCT coefficients to represent the energy for small image patches.
  • The saliency is estimated based on the energy contrast weighted by a Gaussian model of spatial distances between image patches for the consideration of both local and global contrast.
  • A new fusion method is designed to combine the feature maps for the final saliency map.
  • Experimental results show the promising performance of the proposed saliency detection model for stereoscopic images based on the recent eye tracking databases.

Did you find this useful? Give us your feedback

Figures (12)

Content maybe subject to copyright    Report

HAL Id: hal-01059986
https://hal.archives-ouvertes.fr/hal-01059986
Submitted on 15 Sep 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Saliency Detection for Stereoscopic Images
Yuming Fang, Junle Wang, Manish Narwaria, Patrick Le Callet, Weisi Lin
To cite this version:
Yuming Fang, Junle Wang, Manish Narwaria, Patrick Le Callet, Weisi Lin. Saliency Detection for
Stereoscopic Images. IEEE Transactions on Image Processing, Institute of Electrical and Electronics
Engineers, 2014, 23 (6), pp.2625–2636. �10.1109/TIP.2014.2305100�. �hal-01059986�

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014 2625
Saliency Detection for Stereoscopic Images
Yuming Fang, Member, IEEE, Junle Wang, Manish Narwaria, Patrick Le Callet, Member, IEEE,
and Weisi Lin, Senior Member, IEEE
AbstractMany saliency detection models for 2D images have
been proposed for various multimedia processing applications
during the past decades. Currently, the emerging applications
of stereoscopic display require new saliency detection models
for salient region extraction. Different from saliency detection
for 2D images, the depth feature has to be taken into account
in saliency detection for stereoscopic images. In this paper, we
propose a novel stereoscopic saliency detection framework based
on the feature contrast of color, luminance, texture, and depth.
Four types of features, namely color, luminance, texture, and
depth, are extracted from discrete cosine transform coefficients
for feature contrast calculation. A Gaussian model of the spatial
distance between image patches is adopted for consideration
of local and global contrast calculation. Then, a new fusion
method is designed to combine the feature maps to obtain the
final saliency map for stereoscopic images. In addition, we adopt
the center bias factor and human visual acuity, the important
characteristics of the human visual system, to enhance the final
saliency map for stereoscopic images. Experimental results on
eye tracking databases show the superior performance of the
proposed model over other existing methods.
Index TermsStereoscopic image, 3D image, stereoscopic
saliency detection, visual attention, human visual acuity.
I. INTRODUCTION
V
ISUAL attention is an important characteristic in the
Human Visual System (HVS) for visual information
processing. With large amount of visual information, visual
attention would selectively process the important part by
filtering out others to reduce the complexity of scene analysis.
These important visual information is also termed as salient
regions or Regions of Interest (ROIs) in natural images. There
are two different approaches in visual attention mechanism:
bottom-up and top-down. Bottom-up approach, which is data-
driven and task-independent, is a perception process for auto-
matic salient region selection for natural scenes [1]–[8], while
top-down approach is a task-dependent cognitive processing
affected by the performed tasks, feature distribution of targets,
etc. [9]–[11].
Manuscript received June 17, 2013; revised November 14, 2013 and
January 7, 2014; accepted January 26, 2014. Date of publication February 6,
2014; date of current version May 9, 2014. The associate editor coordi-
nating the review of this manuscript and approving it for publication was
Prof. Damon M. Chandler.
Y. Fang is with the School of Information Technology, Jiangxi Uni-
versity of Finance and Economics, Nanchang 330032, China (e-mail:
fa0001ng@e.ntu.edu.sg).
J. Wang, M. Narwaria, and P. Le Callet are with LUNAM Université,
Université de Nantes, Nantes Cedex 3 44306, France (e-mail:
wang.junle@gmail.com; mani0018@e.ntu.edu.sg; patrick.lecallet@
univ-nantes.fr).
W. Lin is with the School of Computer Engineering, Nanyang Technological
University, Singapore 639798 (e-mail: wslin@ntu.edu.sg).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2014.2305100
Over the past decades, many studies have tried to pro-
pose computational models of visual attention for var-
ious multimedia processing applications, such as visual
retargeting [5], visual quality assessment [9], [13], visual
coding [14], etc. In these applications, the salient regions
extracted from saliency detection models are processed
specifically since they attract much more humans’ atten-
tion compared with other regions. Currently, many bottom-
up saliency detection models have been proposed for
2D images/videos [1]–[8].
Today, with the development of stereoscopic display, there
are various emerging applications for 3D multimedia such
as 3D video coding [31], 3D visual quality assessment
[32], [33], 3D rendering [20], etc. In the study [33], the
authors introduced the conflict met by the HVS while watching
3D-TV, how these conflicts might be limited and how visual
comfort might be improved by the visual attention model.
The study also described some other visual attention based
3D multimedia applications, which exist in different stages
of a typical 3D-TV delivery chain, such as 3D video cap-
ture, 2D to 3D conversion, reframing and depth adapta-
tion, etc. Chamaret et al. adopted ROIs for 3D rendering
in the study [20]. Overall, the emerging demand of visual
attention based applications for 3D multimedia increases the
requirement of computational saliency detection models for
3D multimedia content.
Compared with various saliency detection models proposed
for 2D images, only a few studies exploiting the 3D saliency
detection exist currently [18]–[27]. Different from saliency
detection for 2D images, the depth factor has to to be consid-
ered in saliency detection for 3D images. To achieve the depth
perception, binocular depth cues (such as binocular disparity)
are introduced and merged together with others (such as
monocular disparity) in an adaptive way based on the viewing
space conditions. However, this change of depth perception
also largely influences the human viewing behavior [39].
Therefore, how to estimate the saliency from depth cues and
how to combine the saliency from depth with those from other
2D low-level features are two important factors in designing
3D saliency detection models.
In this paper, we propose a novel saliency detection model
for 3D images based on feature contrast from color, luminance,
texture, and depth. The features of color, luminance, texture
and depth are extracted from DCT (Discrete Cosine Trans-
form) coefficients of image patches. It is well accepted that the
DCT is a superior representation for energy compaction and
most of the signal information is concentrated on a few low-
frequency components [34]. Due to its energy compactness
property, the DCT has been widely used in various signal
1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014
processing applications in the past decades. Our previous study
has also demonstrated that DCT coefficients can be adopted
for effective feature representation in saliency detection [5].
Therefore, we use DCT coefficients for feature extraction for
image patches in this study.
In essence, the input stereoscopic image and depth map are
firstly divided into small image patches. Color, luminance and
texture features are extracted based on DCT coefficients of
each image patch from the original image, while depth feature
is extracted based on DCT coefficients of each image patch in
the depth map. Feature contrast is calculated based on center-
surround feature difference, weighted by a Gaussian model of
spatial distances between image patches for the consideration
of local and global contrast. A new fusion method is designed
to combine the feature maps to obtain the final saliency map
for 3D images. Additionally, inspired by the viewing influence
from centre bias and the property of human visual acuity in
the HVS, we propose to incorporate the centre bias factor
and human visual acuity into the proposed model to enhance
the saliency map. The Centre-Bias Map (CBM) calculated
based on centre bias factor and a statistical model of human
visual sensitivity in [38] are adopted to enhance the saliency
map for obtaining the final saliency map of 3D images.
Existing 3D saliency detection models usually adopt depth
information to weight the traditional 2D saliency map [19],
[20], or combine the depth saliency map and the traditional
2D saliency map simply [21], [23] to obtain the saliency
map of 3D images. Different from these existing methods,
the proposed model adopts the low-level features of color,
luminance, texture and depth for saliency calculation in a
whole framework and designs a novel fusion method to obtain
the saliency map from feature maps. Experimental results on
eye-tracking databases demonstrate the superior performance
of the proposed model over other existing methods.
The remaining of this paper is organized as follows.
Section II introduces the related work in the literature.
In Section III, the proposed model is described in detail.
Section IV provides the experimental results on eye tracking
databases. The final section concludes the paper.
II. R
ELATED WORK
As introduced in the previous section, many computa-
tional models of visual attention have been proposed for
various 2D multimedia processing applications. Itti et al.
proposed one of the earliest computational saliency detec-
tion models based on the neuronal architecture of the pri-
mates’ early visual system [1]. In that study, the saliency
map is calculated by feature contrast from color, intensity
and orientation. Later, Harel et al. extended Itti’s model
by using a more accurate measure of dissimilarity [2].
In that study, the graph-based theory is used to mea-
sure saliency from feature contrast. Bruce et al. designed
a saliency detection algorithm based on information max-
imization [3]. The basic theory for saliency detection is
Shannon’s self-information measure [3]. Le Meur et al.
proposed a computational model of visual attention based
on characteristics of the HVS including contrast sensitivity
functions, perceptual decomposition, visual masking, and
center-surround interactions [12].
Hou et al. proposed a saliency detection method by the con-
cept of Spectral Residual [4]. The saliency map is computed
by log spectra representation of images from Fourier Trans-
form. Based on Hou’s model, Guo et al. designed a saliency
detection algorithm based on phase spectrum, in which the
saliency map is calculated by Inverse Fourier Transform on
a constant amplitude spectrum and the original phase spec-
trum [14]. Yan et al. introduced a saliency detection algorithm
based on sparse coding [8]. Recently, some saliency detection
models have been proposed by patch-based contrast and obtain
promising performance for salient region extraction [5]–[7].
Goferman et al. introduced a context-aware saliency detection
model based on feature contrast from color and intensity in
image patches [7]. A saliency detection model in compressed
domain is designed by Fang et al. for the application of image
retargeting [5].
Besides 2D saliency detection models, several studies have
explored the saliency detection for 3D multimedia content.
In [18], Bruce at al. proposed a stereo attention framework by
extending an existing attention architecture to the binocular
domain. However, there is no computational model proposed
in that study [18]. Zhang et al. designed a stereoscopic visual
attention algorithm for 3D video based on multiple perceptual
stimuli [19]. Chamaret et al. built a Region of Interest (ROI)
extraction method for adaptive 3D rendering [20]. Both stud-
ies [19] and [20] adopt depth map to weight the 2D saliency
map to calculate the nal saliency map for 3D images. Another
method of 3D saliency detection model is built by incorporat-
ing depth saliency map into the traditional 2D saliency detec-
tion methods. In [21], Ouerhani et al. extended a 2D saliency
detection model to 3D saliency detection by taking depth cues
into account. Potapova introduced a 3D saliency detection
model for robotics tasks by incorporating the top-down cues
into the bottom-up saliency detection [22]. Lang et al. con-
ducted eye tracking experiments over 2D and 3D images for
depth saliency analysis and proposed 3D saliency detection
models by extending previous 2D saliency detection mod-
els [26]. Niu et al. explored the saliency analysis for stereo-
scopic images by extending a 2D image saliency detection
model [25]. Ciptadi et al. used the features of color and depth
to design a 3D saliency detection model for the application
of image segmentation [27]. Recently, Wang et al. proposed
a computational model of visual attention for 3D images by
extending the traditional 2D saliency detection methods. In
the study [23], the authors provided a public database with
ground-truth of eye-tracking data.
From the above description, the key of 3D saliency detection
model is how to adopt the depth cues besides the traditional
2D low-level features such as color, intensity, orientation,
etc. Previous studies from neuroscience indicate that the
depth feature would cause human beings’ attention focusing
on the salient regions as well as other low-level features
such as color, intensity, motion, etc. [15]–[17]. Therefore,
an accurate 3D saliency detection model should take depth
contrast into account as well as contrast from other common
2D low-level features. Accordingly, we propose a saliency

FANG et al.: SALIENCY DETECTION FOR STEREOSCOPIC IMAGES 2627
Fig. 1. The framework of the proposed model.
detection framework based on the feature contrast from low-
level features of color, luminance, texture and depth. A new
fusion method is designed to combine the feature maps for the
saliency estimation. Furthermore, the centre bias factor and the
human visual acuity are adopted to enhance the saliency map
for 3D images. The proposed 3D saliency detection model
can obtain promising performance for saliency estimation for
3D images, as shown in the experiment section.
III. T
HE PROPOSED MODEL
The framework of the proposed model is depicted as Fig. 1.
Firstly, the color, luminance, texture, and depth features are
extracted from the input stereoscopic image. Based on these
features, the feature contrast is calculated for the feature map
calculation. A fusion method is designed to combine the
feature maps into the saliency map. Additionally, we use the
centre bias factor and a model of human visual acuity to
enhance the saliency map based on the characteristics of the
HVS. We will describe each step in detail in the following
subsections.
A. Feature Extraction
In this study, the input image is divided into small image
patches and then the DCT coefficients are adopted to represent
the energy for each image patch. Our experimental results
show that the proposed model with the patch size within
the visual angle of [0.14, 0.21] (degrees) can get promising
performance. In this paper, we use the patch size of 8 × 8
(the visual angle within the range of [0.14, 0.21] degrees) for
the saliency calculation. The used image patch size is also
the same as DCT block size in JPEG compressed images. The
input RGB image is converted to YCbCr color space due to its
perceptual property. In YCbCr color space, the Y component
represents the luminance information, while Cb and Cr are
two color-opponent components. For the DCT coefficients,
DC coefficients represent the average energy over all pixels in
the image patch, while AC coefficients represent the detailed
frequency properties of the image patch. Thus, we use the
DC coefficient of Y component to represent the luminance
feature for the image patch as L = Y
DC
(Y
DC
is the DC
coefficient of Y component), while the DC coefficients of
Cb and Cr components are adopted to represent the color
features as C
1
= Cb
DC
and C
2
= Cr
DC
(Cb
DC
and Cr
DC
are
the DC coefficients from Cb and Cr components respectively).
Since the Cr and Cb components mainly include the color
information and little texture information is included in these
two channels, we use AC coefficients from only Y component
to represent the texture feature of the image patch. In DCT
block, most of the energy is included in the first several low-
frequency coefficients in the left-upper corner of the DCT
block. As there is little energy with the high-frequency coeffi-
cients in the right-bottom corner of the DCT block, we just use
several first AC coefficients to represent the texture feature of
image patches. The existing study in [35] demonstrates that the
first 9 low-frequency AC coefficients in zig-zag scanning can
represent most energy for the detailed frequency information
in one 8 ×8 image patch. Based on the study [35], we use the
first 9 low-frequency AC coefficients to represent the texture
feature for each image patch as T ={Y
AC1
, Y
AC2
,...,Y
AC9
}.
For the depth feature, we assume that a depth map provides
the information of the perceived depth for the scene. In a
stereoscopic display system, depth information is usually
represented by a disparity map which shows the parallax of
each pixel between the left-view and the right-view images.
The disparity is usually measured in unit of pixels for display
systems. In this study, the depth map M of perceived depth
information is computed based on the disparity as [23]:
M = V/(1 +
d · H
P · W
) (1)
where V represents the viewing distance of the observer;
d denotes the interocular distance; P is the disparity between
pixels; W and H represent the width (in cm) and horizontal
resolution of the display screen, respectively. We set the
parameters based on the experimental studies in [23].
Similar with feature extraction for color and luminance, we
adopt the DC coefficients of patches in depth map calculated
in Eq. (1) as D = M
DC
(M
DC
represents the DC coefficient
of the image patch in depth map M).
As described above, we can extract five features of color,
luminance, texture and depth (L, C
1
, C
2
, T, D) for the input
stereoscopic image. We will introduce how to calculate the
feature map based on these extracted features in the next
subsection.

2628 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014
B. Feature Map Calculation
As we have explained before, salient regions in visual scenes
pop out due to their feature contrast from their surrounding
regions. Thus, a direct method to extract salient regions in
visual scenes is to calculate the feature contrast between image
patches and their surrounding patches in visual scenes. In this
study, we estimate the saliency value of each image patch
based on the feature contrast between this image path and
all the other patches in the image. Here, we use a Gaussian
model of spatial distance between image patches to weight the
feature contrast for saliency calculation. The saliency value F
k
i
of image patch i from feature k can be calculated as:
F
k
i
=
j=i
1
σ
2π
e
l
2
ij
/(2σ
2
)
U
k
ij
(2)
where k represents the feature and k ∈{L, C
1
, C
2
, T, D};
l
ij
denotes the spatial distance between image patches
i and j; U
k
ij
represents the feature difference between image
patches i and j from feature k; σ is the parameter of
the Gaussian model and it determines the degree of local
and global contrast for the saliency estimation. σ is set as
5 based on the experiments of the previous work [5]. For any
image patch i, its saliency value is calculated based on the
center-surround differences between this patch and all other
patches in the image. The weighting for the center-surround
differences is determined by the spatial distances (within the
Gaussian model) between image patches. The differences from
nearer image patches will contribute more to the saliency value
of patch i than those from farther image patches. Thus, we
consider both local and global contrast from different features
in the proposed saliency detection model.
The feature difference U
k
ij
between image patches i and j
is computed differently from features k due to the different
feature representation method. Since the color, luminance and
depth features are represented by one DC coefficient for each
image patch, the feature contrast from these features (lumi-
nance, color and depth) between two image patches i and j can
be calculated as the difference between two DC coefficients
of two corresponding image patches as follows.
U
m
ij
=
|B
m
i
B
m
j
|
B
m
i
+ B
m
j
(3)
where B
m
represents the feature and B
m
∈{L, C
1
, C
2
, D};
the denominator is used to normalize the feature contrast.
Since texture feature is represented as 9 low-frequency
AC coefficients, we calculate the feature contrast from texture
by the L2 norm. The feature contrast U
ij
from texture feature
between two image patches i and j can be computed as
follows.
U
ij
=
t
(B
t
i
B
t
j
)
2
t
(B
t
i
+ B
t
j
)
(4)
where t represents the AC coefficients and t ∈{1, 2, ..., 9};
B
represents the texture feature; the denominator is adopted
to normalize the feature contrast.
C. Saliency Estimation from Feature Map Fusion
After calculating feature maps indicated in Eq. (2), we fuse
these feature maps from color, luminance, texture and depth
to compute the final saliency map. It is well accepted that
different visual dimensions in natural scenes are competing
with each other during the combination for the final saliency
map [40], [41]. Existing studies have shown that a stimulus
from several saliency features is generally more conspicuous
than that from only one single feature [1], [41]. The differ-
ent visual features interact and contribute simultaneously to
the saliency of visual scenes. Currently, existing studies of
3D saliency detection (e.g. [23]) use simple linear combination
to fuse the feature maps to obtain the final saliency map. The
weighting of the linear combination is set as constant values
and is the same for all images. To address the drawbacks from
ad-hoc weighting of linear combination for different feature
maps, we propose a new fusion method to assign adaptive
weighting for the fusion of feature maps in this study.
Generally, the salient regions in a good saliency map should
be small and compact, since the HVS always focus on some
specific interesting regions in images. Thus, a good feature
map should detect small and compact regions in the image.
During the fusion of different feature maps, we can assign
more weighting for those feature maps with small and compact
salient regions and less weighting for others with more spread
salient regions. Here, we define the measure of compactness
by the spatial variance of feature maps. The spatial variance
υ
k
of feature map F
k
can be computed as follows.
υ
k
=
(i, j)
(i E
i,k
)
2
+ ( j E
j,k
)
2
· F
k
(i, j)
(i, j)
F
k
(i, j)
(5)
where (i, j) is the spatial location in the feature map;
k represents the feature channel and k ∈{L, C
1
, C
2
, T, D};
(E
i,k
, E
j,k
) is the average spatial location weighted by feature
response, which is calculated as:
E
i,k
=
(i, j)
i · F
k
(i, j)
(i, j)
F
k
(i, j)
(6)
E
j,k
=
(i, j)
j · F
k
(i, j)
(i, j)
F
k
(i, j)
(7)
We use the normalized υ
k
values to represent the compact-
ness property for feature maps. With larger spatial variance
values, the feature map is supposed to be less compact. We
calculate the compactness β
k
of the feature map F
k
as follows.
β
k
= 1/(e
υ
k
) (8)
where k represents the feature channel and k
{L, C
1
, C
2
, T, D}.
Based on compactness property of feature maps calculated
in Eq. (8), we fuse the feature maps for the saliency map as
follows.
S
f
=
k
β
k
· F
k
+
p=q
β
p
· β
q
· F
p
· F
q
(9)
The first term in Eq. (9) represents the linear combination
of feature maps weighted by corresponding compactness prop-
erties of feature maps; while the second term is adopted to

Citations
More filters
Proceedings ArticleDOI
15 Jun 2019
TL;DR: Contrast prior is utilized, which used to be a dominant cue in none deep learning based SOD approaches, into CNNs-based architecture to enhance the depth information and is integrated with RGB features for SOD, using a novel fluid pyramid integration.
Abstract: The large availability of depth sensors provides valuable complementary information for salient object detection (SOD) in RGBD images. However, due to the inherent difference between RGB and depth information, extracting features from the depth channel using ImageNet pre-trained backbone models and fusing them with RGB features directly are sub-optimal. In this paper, we utilize contrast prior, which used to be a dominant cue in none deep learning based SOD approaches, into CNNs-based architecture to enhance the depth information. The enhanced depth cues are further integrated with RGB features for SOD, using a novel fluid pyramid integration, which can make better use of multi-scale cross-modal features. Comprehensive experiments on 5 challenging benchmark datasets demonstrate the superiority of the architecture CPFP over 9 state-of-the-art alternative methods.

385 citations

Journal ArticleDOI
TL;DR: A novel framework based on convolutional neural networks (CNNs), which transfers the structure of the RGB-based deep neural network to be applicable for depth view and fuses the deep representations of both views automatically to obtain the final saliency map is proposed.
Abstract: Salient object detection from RGB-D images aims to utilize both the depth view and RGB view to automatically localize objects of human interest in the scene. Although a few earlier efforts have been devoted to the study of this paper in recent years, two major challenges still remain: 1) how to leverage the depth view effectively to model the depth-induced saliency and 2) how to implement an optimal combination of the RGB view and depth view, which can make full use of complementary information among them. To address these two challenges, this paper proposes a novel framework based on convolutional neural networks (CNNs), which transfers the structure of the RGB-based deep neural network to be applicable for depth view and fuses the deep representations of both views automatically to obtain the final saliency map. In the proposed framework, the first challenge is modeled as a cross-view transfer problem and addressed by using the task-relevant initialization and adding deep supervision in hidden layer. The second challenge is addressed by a multiview CNN fusion model through a combination layer connecting the representation layers of RGB view and depth view. Comprehensive experiments on four benchmark datasets demonstrate the significant and consistent improvements of the proposed approach over other state-of-the-art methods.

364 citations


Cites methods from "Saliency Detection for Stereoscopic..."

  • ...Linear weighted fusion including summation (SUM) [8], [25] and multiplication (MUL) [7] are two of the simplest and most widely used fusion methods....

    [...]

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper reviewed different types of saliency detection algorithms, summarize the important issues of the existing methods, and discuss the existent problems and future works, and the experimental analysis and discussion are conducted to provide a holistic overview of different saliency detectors.
Abstract: The visual saliency detection model simulates the human visual system to perceive the scene and has been widely used in many vision tasks. With the development of acquisition technology, more comprehensive information, such as depth cue, inter-image correspondence, or temporal relationship, is available to extend image saliency detection to RGBD saliency detection, co-saliency detection, or video saliency detection. The RGBD saliency detection model focuses on extracting the salient regions from RGBD images by combining the depth information. The co-saliency detection model introduces the inter-image correspondence constraint to discover the common salient object in an image group. The goal of the video saliency detection model is to locate the motion-related salient object in video sequences, which considers the motion cue and spatiotemporal constraint jointly. In this paper, we review different types of saliency detection algorithms, summarize the important issues of the existing methods, and discuss the existent problems and future works. Moreover, the evaluation datasets and quantitative measurements are briefly introduced, and the experimental analysis and discussion are conducted to provide a holistic overview of different saliency detection methods.

328 citations

Journal ArticleDOI
TL;DR: A novel multi-scale multi-path fusion network with cross-modal interactions (MMCI), in which the traditional two-stream fusion architecture with single fusion path is advanced by diversifying the fusion path to a global reasoning one and another local capturing one and meanwhile introducing cross- modal interactions in multiple layers.

310 citations

Proceedings ArticleDOI
01 Jun 2018
TL;DR: A novel complementarity-aware fusion (CA-Fuse) module when adopting the Convolutional Neural Network (CNN) and the proposed RGB-D fusion network disambiguates both cross-modal and cross-level fusion processes and enables more sufficient fusion results.
Abstract: How to incorporate cross-modal complementarity sufficiently is the cornerstone question for RGB-D salient object detection. Previous works mainly address this issue by simply concatenating multi-modal features or combining unimodal predictions. In this paper, we answer this question from two perspectives: (1) We argue that if the complementary part can be modelled more explicitly, the cross-modal complement is likely to be better captured. To this end, we design a novel complementarity-aware fusion (CA-Fuse) module when adopting the Convolutional Neural Network (CNN). By introducing cross-modal residual functions and complementarity-aware supervisions in each CA-Fuse module, the problem of learning complementary information from the paired modality is explicitly posed as asymptotically approximating the residual function. (2) Exploring the complement across all the levels. By cascading the CA-Fuse module and adding level-wise supervision from deep to shallow densely, the cross-level complement can be selected and combined progressively. The proposed RGB-D fusion network disambiguates both cross-modal and cross-level fusion processes and enables more sufficient fusion results. The experiments on public datasets show the effectiveness of the proposed CA-Fuse module and the RGB-D salient object detection network.

303 citations


Cites methods from "Saliency Detection for Stereoscopic..."

  • ...Result fusion methods include summation [35, 42], multiplication [31] and designed rules [33]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new hypothesis about the role of focused attention is proposed, which offers a new set of criteria for distinguishing separable from integral features and a new rationale for predicting which tasks will show attention limits and which will not.

11,452 citations


"Saliency Detection for Stereoscopic..." refers background or methods in this paper

  • ...According to the Feature Integration Theory (FIT) [13], the early selective attention causes some image regions to be salient due to their different features (such as color, intensity, texture, depth, etc....

    [...]

  • ...According to FIT [13], the salient regions in visual scenes will pop out due to their feature contrast from their surrounding regions....

    [...]

  • ...Based on the FIT, many bottom-up saliency detection models have been proposed for 2D images/videos recently [1]-[6]....

    [...]

  • ...According to the Feature Integration Theory (FIT) [13], the early selective attention causes some image regions to be salient due to their different features (such as color, intensity, texture, depth, etc.) from their surrounding regions....

    [...]

Journal ArticleDOI
TL;DR: In this article, a visual attention system inspired by the behavior and the neuronal architecture of the early primate visual system is presented, where multiscale image features are combined into a single topographical saliency map.
Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

10,525 citations

01 Jan 1998
TL;DR: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented, which breaks down the complex problem of scene understanding by rapidly selecting conspicuous locations to be analyzed in detail.

8,566 citations


"Saliency Detection for Stereoscopic..." refers background or methods in this paper

  • ...proposed one of the earliest computational saliency detection model based on the neuronal architecture of the primates’ early visual system [1]....

    [...]

  • ...Bottom-up approach, which is data-driven and task-independent, is a perception process for automatic salient region selection for natural scenes [1]-[7], while topdown approach is a task-dependent cognitive processing affected by the performed tasks, feature distribution of targets, and so on [8]-[9]....

    [...]

  • ...Based on the FIT, many bottom-up saliency detection models have been proposed for 2D images/videos recently [1]-[6]....

    [...]

  • ...In Table 1, Model 1 in [23] represents the fusion method from 2D saliency detection model in [1] and depth model in [23]; Table 1....

    [...]

  • ...In contrast, the existing models in [23] which incorporate the 2D saliency methods [1, 2, 3] are designed for only bottom-up mechanis-...

    [...]

Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work proposes a regional contrast based saliency extraction algorithm, which simultaneously evaluates global contrast differences and spatial coherence, and consistently outperformed existing saliency detection methods.
Abstract: Automatic estimation of salient object regions across images, without any prior assumption or knowledge of the contents of the corresponding scenes, enhances many computer vision and computer graphics applications. We introduce a regional contrast based salient object detection algorithm, which simultaneously evaluates global contrast differences and spatial weighted coherence scores. The proposed algorithm is simple, efficient, naturally multi-scale, and produces full-resolution, high-quality saliency maps. These saliency maps are further used to initialize a novel iterative version of GrabCut, namely SaliencyCut, for high quality unsupervised salient object segmentation. We extensively evaluated our algorithm using traditional salient object detection datasets, as well as a more challenging Internet image dataset. Our experimental results demonstrate that our algorithm consistently outperforms 15 existing salient object detection and segmentation methods, yielding higher precision and better recall rates. We also show that our algorithm can be used to efficiently extract salient object masks from Internet images, enabling effective sketch-based image retrieval (SBIR) via simple shape comparisons. Despite such noisy internet images, where the saliency regions are ambiguous, our saliency guided image retrieval achieves a superior retrieval rate compared with state-of-the-art SBIR methods, and additionally provides important target object region information.

3,653 citations


"Saliency Detection for Stereoscopic..." refers background in this paper

  • ...explored the saliency analysis for stereoscopic images by extending a 2D image saliency detection model [25]....

    [...]

Proceedings Article
04 Dec 2006
TL;DR: A new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed, which powerfully predicts human fixations on 749 variations of 108 natural images, achieving 98% of the ROC area of a human-based control, whereas the classical algorithms of Itti & Koch achieve only 84%.
Abstract: A new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed It consists of two steps: first forming activation maps on certain feature channels, and then normalizing them in a way which highlights conspicuity and admits combination with other maps The model is simple, and biologically plausible insofar as it is naturally parallelized This model powerfully predicts human fixations on 749 variations of 108 natural images, achieving 98% of the ROC area of a human-based control, whereas the classical algorithms of Itti & Koch ([2], [3], [4]) achieve only 84%

3,475 citations


"Saliency Detection for Stereoscopic..." refers background or methods in this paper

  • ...extended Itti’s model by using a more accurate measure of dissimilarity [2]....

    [...]

  • ...GBVS [2], AIM [3], FT [4], ICL [47], LSK [48],...

    [...]

Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Saliency detection for stereoscopic images" ?

In this paper, the authors propose a novel stereoscopic saliency detection framework based on the feature contrast of color, luminance, texture, and depth. 

With the enhancement operation by the centre bias factor, the saliency values of center regions in images would increase, while with the enhancement operation by human visual acuity, the saliency values of non-salient regions in natural scenes would decrease and the saliency map would get visually better. 

Since texture feature is represented as 9 low-frequency AC coefficients, the authors calculate the feature contrast from texture by the L2 norm. 

Another method of 3D saliency detection model is built by incorporating depth saliency map into the traditional 2D saliency detection methods. 

Similar with feature extraction for color and luminance, the authors adopt the DC coefficients of patches in depth map calculated in Eq. (1) as D = MDC (MDC represents the DC coefficient of the image patch in depth map M). 

Based on the study [35], the authors use the first 9 low-frequency AC coefficients to represent the texture feature for each image patch as T = {YAC1, YAC2, . . . , YAC9}. 

Among these measures, PLCC andKLD are calculated directly from the comparison between the fixation density map and the predicted saliency map, while AUC is computed from the comparison between the actual gaze points and the predicted saliency map. 

Compared with feature maps from these low-level features of color, luminance, texture and depth, the final saliency map calculated from the proposed fusion method can get much better performance for saliency estimation for 3D images, as shown by the PLCC, KLD and AUC values in Table I. 

The retina eccentricity e between the salient pixel and nonsalient pixel can be computed according to its relationship with spatial distance between image pixels. 

The proposed 3D saliency detection model can obtain promising performance for saliency estimation for 3D images, as shown in the experiment section.