scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Background Prior-Based Salient Object Detection via Deep Reconstruction Residual

TL;DR: A novel framework for saliency detection is proposed by first modeling the background and then separating salient objects from the background by developing stacked denoising autoencoders with deep learning architectures to model the background.
Abstract: Detection of salient objects from images is gaining increasing research interest in recent years as it can substantially facilitate a wide range of content-based multimedia applications. Based on the assumption that foreground salient regions are distinctive within a certain context, most conventional approaches rely on a number of hand-designed features and their distinctiveness is measured using local or global contrast. Although these approaches have been shown to be effective in dealing with simple images, their limited capability may cause difficulties when dealing with more complicated images. This paper proposes a novel framework for saliency detection by first modeling the background and then separating salient objects from the background. We develop stacked denoising autoencoders with deep learning architectures to model the background where latent patterns are explored and more powerful representations of data are learned in an unsupervised and bottom-up manner. Afterward, we formulate the separation of salient objects from the background as a problem of measuring reconstruction residuals of deep autoencoders. Comprehensive evaluations of three benchmark datasets and comparisons with nine state-of-the-art algorithms demonstrate the superiority of this paper.

Summary (4 min read)

Introduction

  • A few recent approaches tried to learn better representations from natural scenes for saliency detection by using independent component analysis (ICA) [8], sparse coding [9, 10], and low-rank matrix recovery [11].
  • To be specific, in [15] and [16] the global contrast is derived in the frequency domain with the hypothesis that salient regions are normally less frequent.
  • They represent the image as a close-loop graph with superpixels as nodes.
  • Fig. 2 illustrates the workflow of the proposed framework.

II. THE PROPOSED APPROACH

  • The authors discuss the proposed method for salient object detection in details.
  • It includes three subsections, which in turn introduce SDAE, the proposed salient detection framework, and two useful post-processing steps, respectively.

A. Stacked Denoising Autoencoder (SDAE)

  • Autoencoders are simple learning neural networks which aim to transform inputs into outputs with the least possible amount of distortion for learning latent patterns of the given data.
  • Specifically, it includes an encoding process and a decoding process.
  • Usually, training a DAE is straightforward, where the back-propagation algorithm can be used to compute the gradient of the objective function [26, 27], and the same target activation function can be used in all the layers when training SDAE.

B. Saliency Detection via Deep Reconstruction Residual

  • As the authors mentioned in Section I, local and global contrast-based methods lack the ability to precisely compute the contrast between foreground objects and the background.
  • The authors follow the basic rule of photographic composition and assume that the image boundary is mostly background.
  • Specifically, the authors separately define four boundaries for each image as shown in side-specific SDAE training of Fig.
  • Finally, the four residual maps are linearly combined to generate the saliency map R S . =R top bottom left rightS R R R R+ + + /4 (12).

C. Post Processing

  • As discussed above, the authors compute saliency map R S at five different image scales to account for scale changes in salient objects.
  • To integrate salient regions in different scales, the authors use the average value of the five single scale saliency maps to generate the multi-scale integrated saliency map R S .
  • To further refine the results, two post-processing steps are adopted on the basis of the image organization priors and the region property as presented in details below.

1) Image organization refinement

  • According to the visual organization rules in [33], these cases can be refined by considering the visual contextual effect.
  • In the first component, as suggested by [34], which states that the salient pixels tend to group together, as they typically correspond to real objects in the scene, the authors propose to use a self-adaptive threshold ( ) R t = mean S to obtain the salient cluster firstly.
  • In the second component, to deal with the case where highlighted regions omit a bit of real foreground, the authors follow [35] to include the immediate context by weighting the saliency value of each pixel based on their distance to the high salient pixel locations.
  • To encode immediate context information, high salient pixel locations = R S tΦ > are found and the saliency value at all pixel locations are weighted by their distance to Φ .

2) Region smoothing

  • In order to highlight the entire salient object uniformly and recover more edge information, inspired by [35], the authors refine the saliency of each pixel using the region information.
  • Specifically, a graph based segmentation algorithm [36] is used to decompose the image into a number of small regions and the final saliency of each region is calculated by the average saliency value of all the pixels within it.
  • Examples of region smoothing results are shown in the fifth column of Fig. 4.

III. EXPERIMENTS

  • To evaluate the performance of the proposed salient object detection framework, the authors compared it with 9 state-of-the-art approaches, which have been published within last 3 years and in top journals or conferences.
  • To obtain the performance of these 9 methods, the authors adopted either the author-provided implementations or author-provided saliency maps.
  • To their best knowledge, this dataset is one of the largest test sets for salient object detection whose ground truth is in the form of manually labeled accurate object contours instead of rough bounding boxes.
  • It can be observed that, compared with PD, GBMR, GS-S, GS-G, BLSM, and CNTX, the proposed method can highlight salient region more uniformly.

A. Evaluation Metrics

  • By following previous works of [9, 12, 15, 16, 34, 41-43], four metrics are adopted in their experiments to quantitatively measure the performance of saliency map, which include the receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), precision recall (PR) curve, and the average precision (AP).
  • Observing the Gaussian-like distributions of the saliency value in the proposed saliency maps, an adaptive threshold T = +µ σ as suggested in [44] is used to segment the saliency maps.
  • For each segmented foreground binary map T SF under the adaptive threshold T , the authors follow [51] to evaluate it by using the weighted F-measure.
  • In order to take into consideration both the dependency between pixels and the location of the errors, a weighting function is applied to the errors as = ( ) w E min E,E ⋅A Β .
  • Then, the weighted true positive w TP , the weighted false positive w FP and the weighted false negative w FN can be calculated by 1051-8215 (c) 2013 IEEE.

B. Parameters Analysis and Model Evaluation

  • The authors analyze the effect of a few key parameters in the proposed model on performance.
  • Here the authors conducted the evaluation on the SOD and SED datasets.
  • Some examples of the experimental results obtained under different β are also given in Fig.
  • From the second and the third column of Fig. 7, the authors can see that for the images with clustered background, the sparsity is an essential element for suppressing the saliency of the background regions.
  • Similar phenomenon is also discovered in [48, 49].

C. Evaluations on the ASD D

  • The authors conducted quantitative comparisons on the ASD dataset -of-the-art methods on the ASD dataset.
  • > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE using ROC, PR, AUC, AP, and weighted performance metric.
  • From the ROC that the proposed method achieves the highest true positive rate when the false positive rate is between about 0.05 and 1. result, the proposed method outperforms other 9 algorithms in terms of ROC and AUC.
  • The statistics results can reflect the distributions of the true salient pixels and true background pixels on the calculated saliency value.

D. Evaluations on the SOD D

  • The authors also conducted the comparisons on the more challenging -of-the-art methods on the SOD dataset.
  • All the comparison results, including ROC, AUC, PR, AP, and weighted F-measure, are shown in Figs. 13-15.
  • From Fig. 15, it is observed that the proposed approach can achieve the highest weighted F-measure also shows that the weighted recall values of most of the state-of-the-art are less than 0.6 whereas the proposed approach can achieve the highest weighted recall value that is around 0.64, which indicates the proposed method tends t the entire salient objects.
  • For the foreground distribution and background distribution, similar observations can be found in comparison of results obtained from different approaches.
  • As shown in Fig. 16, the distributions on the SOD dataset tend to worse obviously.

E. Evaluations on the SED dataset

  • The proposed approach was also tested on the SED database, another challenging dataset.
  • As GS-S and GS provided their codes and their results on this dataset, the authors are -CLICK HERE TO EDIT) < score.
  • More encouragingly, compared with other state the proposed method has achieved the higher t in the whole ROC curve, and the higher precision values along almost the whole PR curve as well.
  • Similar to the SOD dataset, SED dataset also contains a large number of images with complicated content and multiple salient objects.
  • The experimental results show that the proposed algorithm has more powerful capability to handle these to.

F. Running time

  • Table II lists the average execution time in processing an image of size 400×300 by using different approaches.
  • For the implementation of the proposed method, the authors used the parallel computing toolbox of MATLAB and executed the code on the NVIDIA GPU named GeForce GTX Titan Black.
  • For other state-of-the-art approaches, the authors used the source codes provided by their authors.
  • The authors did not compare with GS because the corresponding codes have not been released by the authors.
  • As can be seen, the proposed algorithm has the moderate computational complexity.

IV. CONCLUSION

  • The authors have proposed a bottom detection framework based on the background prior.
  • > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 13 approaches is twofold.
  • First, instead of using traditional hand-designed features, the proposed algorithm adopted SDAE with deep structures to learn more powerful representations for saliency computation.
  • For the further work, the authors tend to extend the proposed work in the following directions.

Did you find this useful? Give us your feedback

Figures (18)

Content maybe subject to copyright    Report

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1
Copyright (c) 2014 IEEE. Personal use of this material is
permitted.
However, permission to use this material for any other purposes
must be obtained from the IEEE by sending an email to
pubs-permissions@ieee.org

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
2
Abstract—Detection of salient objects from images is gaining
increasing research interest in recent years as it can substantially
facilitate a wide range of content-based multimedia applications.
Based on the assumption that foreground salient regions are
distinctive within a certain context, most conventional approaches
rely on a number of hand designed features and their
distinctiveness measured using local or global contrast. Although
these approaches have shown effective in dealing with simple
images, their limited capability may cause difficulties when
dealing with more complicated images. This paper proposes a
novel framework for saliency detection by first modeling the
background and then separating salient objects from the
background. We develop stacked denoising autoencoders with
deep learning architectures to model the background where latent
patterns are explored and more powerful representations of data
are learnt in an unsupervised and bottom up manner. Afterwards,
we formulate the separation of salient objects from the
background as a problem of measuring reconstruction residuals
of deep autoencoders. Comprehensive evaluations on three
benchmark datasets and comparisons with 9 state-of-the-art
algorithms demonstrate the superiority of the proposed work.
Index Terms—salient object detection, stacked denoising
autoencoder, background prior, deep reconstruction residual.
I. I
NTRODUCTION
ALIENT
object detection aiming to discover the most
important and informative parts in an image is gaining
intensive research attention recently as it can serve as a base for
a large number of multimedia applications such as image
resizing, image montage, action analysis and visual recognition
[1-4]. Based on the underlying hypothesis that the salient
stimulus is distinct from its contextual stimuli, most existing
saliency detection models need to solve two key problems: i)
extract effective features to represent the image and, ii) develop
an optimal mechanism to measure the distinctiveness over the
extracted features.
The performance of saliency detection models heavily relies
Manuscript received on April 14, 2014. This work was partially supported
by the National Science Foundation of China under Grant 61103061, 91120005,
and 61473231.
Junwei Han, Dingwen Zhang, Xintao Hu, and Lei Guo are with School of
Automation, Northwestern Polytechnical University, Xi’an China. (phone and
fax: 86-29-88431318; e-mail: junweihan2010@gmail.com).
Jinchang Ren is with the Department of Electronic and Electrical
Engineering, University of Strathclyde, UK.
Feng Wu is with School of Information Science, University of Science and
Technology of China.
on the features (data representations) being used. In the last 15
years, a variety of features have been proposed for the task of
image saliency detection. The earliest saliency computation
model by Itti et al. [5] proposed three biological plausible
features including color, intensity, and orientation. In Judd et al.
[6], besides Itti's three features, several new features were
introduced to characterize image content, which include the
local energy of the steerable pyramid filters, subband pyramids
based features, 3D color histogram, and horizon line detector.
As visual attention could be directed by specific objects, some
detectors of face, car, and person were treated as features for
detecting saliency [6, 7]. All these feature representations are
hand-designed and require significant amounts of domain
knowledge. However, hand-designed features in general suffer
poor generalization capability for different images, especially
due to the lack of thorough understanding of the biological
mechanisms and principles of human visual attention as well as
weak human intuition involved. A few recent approaches tried
to learn better representations from natural scenes for saliency
detection by using independent component analysis (ICA) [8],
sparse coding [9, 10], and low-rank matrix recovery [11].
Nevertheless, due to the shallow-structured architectures used
these methods still have limited representational power and are
insufficient to capture high-level information and latent
patterns of complex image data. To overcome such drawbacks,
in this paper, we investigate the feasibility of learning more
powerful representation directly from the raw image data itself
in an unsupervised way for the task of saliency detection.
The saliency or distinctiveness is typically measured by
image contrast computation over features, where various
contrast measures have been presented. Depending on the
extent of context in which the contrast is calculated, these
approaches can be classified into local-contrast based methods
and global-contrast based methods. Local-contrast based
methods estimate the saliency of an image pixel or an image
patch by calculating the contrast against its local neighborhood,
and some representative local methods include the
center-surround difference [5, 6, 12, 13], incremental coding
length [10], and self-resemblance [14]. Global-contrast based
methods characterize the saliency of an image region as the
uniqueness in the entire image. Previous literatures have
proposed a variety of approaches to model the global contrast
from different perspectives. To be specific, in [15] and [16] the
global contrast is derived in the frequency domain with the
hypothesis that salient regions are normally less frequent. Han
et al. [9] and Zhang et al. [8] utilized the Gaussian models to
Background Prior Based Salient Object
Detection via Deep Reconstruction Residual
Junwei Han, Dingwen Zhang, Xintao Hu, Lei Guo, Jinchang Ren, and Feng Wu
S

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
3
calculate the global contrast. Cheng et al. [17] proposed to
model the global contrast on the region level where each
region's contrast is generated by a weighted summation of the
differences between itself and all other regions. Shen et al. [11]
represented a whole image as a low-rank matrix with sparse
noises where sparse noises denote the salient regions.
In spite of extensive efforts, local and global contrast based
approaches still suffer from some drawbacks. First, these
approaches normally can only highlight object boundaries but
fail to detect the whole target region uniformly as shown in the
examples given in Fig. 1. This problem may be alleviated in
some global-contrast based methods while the results yielded
are still unsatisfactory. Second, although the salient objects
often present high contrast, the inverse might unnecessarily be
true [11]. In many complex images (as shown in the third
example of Fig. 1), the background contains small-scale
high-contrast patterns which may lead to previous
contrast-based methods fail in such cases.
Essentially, the true aim of salient object detection is to find
objects that are distinctive from the image background. It needs
to calculate the contrast between the objects and the image
background and then select those with high contrast as the
salient objects. However, the local and global contrast-based
methods do not identify which regions form the image
background. They blindly assume the neighboring regions or
the entire image to be the background and then calculate the
contrast between each location and the assumed background.
As their assumed background may not be the real one, the
determined contrast also becomes incorrect, which in turn
reduces the performance of saliency detection. To overcome
these problems, a few emerging methods [18, 19] using
background priors were proposed based on the idea of
modeling the property of background first and thereby
separating salient objects from the background. Specially, Wei
et al. [18] exploited the boundary and connectivity priors about
the background in natural images and detected saliency based
on the geodesic distance. Considering that the salient object
may be partially cropped on the boundary, this work adopts an
existing saliency detection method [33] to compute the saliency
of boundary patches and generates weights for the virtual
background nodes. However, in some challenging images
where the work [33] could not calculate the saliency of
boundary patches precisely, the method of [18] is difficult to
obtain satisfactory results. Yang et al. [19] modeled saliency
detection as a manifold ranking problem and proposed a
two-stage scheme for graph labelling. They represent the image
as a close-loop graph with superpixels as nodes. In saliency
detection, they first use the nodes on the image boundary as
background seeds to rank other nodes in the graph. Then, in the
second stage, they select the salient nodes from the detection
results of the first stage and use them to refine the saliency of
other nodes in the graph. On the assumption that the image
boundary is mostly background, these methods result in a
background template. As a result, the contrast between salient
object and background can be precisely obtained. By
incorporating background priors into traditional contrast-based
methods, they show improved results in saliency detection.
However, existing background prior based methods still
have certain limitations. Typically, there are four scenarios
where performing background prior based saliency detection as
summarized below.
1) The entire image boundary is a large and smoothly
connected region (see the first row of Fig. 1);
2) The regions defined within the image boundary look
different whereas they may share certain latent pattern (see the
second row of Fig. 1);
3) The background is complex (for example, containing
small-scale high-contrast patterns) and regions of image
boundary are different as shown in the third row of Fig. 1;
4) Salient objects significantly touch the image boundary and
parts of them are wrongly considered as background as shown
in the fourth row of Fig. 1.
As can be seen in Fig. 1, existing background prior based
approaches [18] are effective for the first scenario and
moderately effective for the second scenario. However,
unsatisfactory results are produced in dealing with the last two
scenarios. In this paper, we propose a novel background prior
based saliency detection framework using stacked denoising
autoencoder (SDAE) with deep learning architectures. In the
proposed work, SDAE is used to model image background.
Rather than adopting hand-designed features as used in
previous works [18, 19], the deep-structured SDAE is
employed to learn more powerful representation directly from
the raw image data in an unsupervised way, which also enables
to capture the latent pattern of the input data hierarchically. It
thus helps to deal with the second scenario (shown in the
second row of Fig. 1) where the background regions share
latent patterns. Then, the measure of contrast between salient
objects and the background is formulated as the reconstruction
residuals in the deep-structured SDAE. Different from the
previous works [18, 19] which mainly focused on the way to
calculate the similarity or distinctiveness between a certain
image patch and the image boundary, the proposed work pays
more attention to modeling the background regions.
Fig. 1.
Some examples of saliency detection. (a) Input images. (b) Results
from one local contrast method [5]. (c) Results from one global contrast
method [15]. (d) Results from the background prior based method [18]. (e)
Results from the proposed method. (f) Ground truth salient object masks.

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
4
Specifically, the sparsity is considered when training SDAE
models, which is helpful to suppress the saliency of the
background regions. Therefore, it is robust in handling the third
scenario (shown in the third row of Fig. 1) where the most
challenging task is to avoid mis-highlighting the small-scale
high-contrast background regions in the saliency maps. In
addition, the learning process of SDAE with the usage of
stochastic corruption criteria is helpful to train a deep model for
better robustness and feature representation. Thus, the trained
robust SDAE shows promising performance in these scenarios.
Fig. 2 illustrates the workflow of the proposed framework.
First, we down sample the original image to multiple scales to
generate the multi-scale inputs. Afterwards, we explore the
background prior via SDAE and detect salient regions by deep
reconstruction residuals which can reflect the distinctness
between the background and salient regions. Finally, post
processes are applied to integrate the salient object detection
results for each scale of input and generate the final saliency
map by image organization refinement and region smoothing.
The rest of the paper is organized as follows. Section II
introduces the proposed approach in details. Section III
presents experimental results with quantitative evaluation in
comparison with a group of state-of-the-art approaches. Finally,
several concluding remarks are drawn in Section IV.
II. T
HE
P
ROPOSED
A
PPROACH
In this section, we discuss the proposed method for salient
object detection in details. It includes three subsections, which
in turn introduce SDAE, the proposed salient detection
framework, and two useful post-processing steps, respectively.
A. Stacked Denoising Autoencoder (SDAE)
Autoencoders are simple learning neural networks which aim
to transform inputs into outputs with the least possible amount
of distortion for learning latent patterns of the given data. While
conceptually simple, they play an important role in machine
learning and feature representation. More recently,
autoencoders have taken center stage again in the “deep
architecture” approaches [20-23], where autoencoders are
stacked and pre-trained in an unsupervised fashion. These deep
architectures have been shown to lead to state-of-the-art results
on a number of classification and regression problems [24].
As a form of neural network, the classical autoencoder [24] is
an unsupervised learning algorithm that applies
back-propagation and sets the target values of the network
outputs to be equal to the inputs. Specifically, it includes an
encoding process and a decoding process. The encoding
process uses an encoding function
( )f ,
θ
i f
x
to take a
nonlinear mapping from the visible input vector
i
x
to a hidden
representation vector
i
y
by using an affine transformation with
a projection matrix
W
and a bias
b
. Normally, the sigmoid
function
1 (1 ( ))sigm / exp
η η
= +
is used as the
deterministic mapping as follows:
( ) ( )
f
f , sigm
θ
= = +W
i i i
y x x b
(1)
A decoding function
( )g ,
θ
i g
y
is adopted to map the hidden
representation
i
y
back to a reconstruction representation
i
z
through a similar transformation:
( ) ( )
g
g , sigm ' '
θ
= = +W
i i i
z y y b
(2)
After the decoding process, the obtained reconstruction is
taken as a prediction of input
i
x
. The training of an
autoencoder is to optimize the parameters
={ , }
f
θ
W b
and
={ , }' '
θ
W
g
b
by minimizing the mean-squared reconstruction
error between the training data and their reconstructed data via:
arg ,
f g
,
min L
θ θ
X Z
(3)
2
2
1
1
,
2
X Z
m
i i
i
L || ||
=
=
x z
(4)
where
={ }, ={ } [1, ]
i i
i m
X Z
x z
denote all the training and
reconstructed data, respectively.
Stacked autoencoder (SAE) is a deep learning architecture of
the classical autoencoders, which is built by stacking additional
unsupervised feature learning layers, and can be trained using
greedy methods for each additional layer. Specifically, once the
first layer is trained, the hidden representation of the first layer
can be treated as the input of the second layer. As a result, any
number of the K layers in this deep architecture can be trained
effectively. This deep architecture allows SAE to learn more
complex mapping from the input to hidden representations and
capture the latent patterns which reflects the most homogametic
property shared among the training data.
Fig. 2. The workflow of the proposed framework.

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
5
The stacked denoising autoencoder (SDAE) [25] is an
extension of the SAE. It builds a deep architecture by stacking
multiple layers of the denoising autoencoder (DAE) which
reconstructs the input into a corrupted and partially destroyed
version. By introducing stochastic corruption to the training
samples, SDAE can avoid over-fitting and achieve better learnt
features, where non-trivial features are robust to input noise and
useful for the further tasks. For a two-layered SDAE, it is done
by first corrupting the initial input
X
i
x
into
i
x
by using a
stochastic mapping
= ( )
qD |
ɶ ɶ
i i i
x x x
. According to [24, 25],
= ( )
qD |
ɶ ɶ
i i i
x x x
is implemented by randomly selecting a
fraction (10% in this paper) of the input data and forcing them
to be zero. In the bottom layer, corrupted input
i
x
is then
mapped to a hidden representation
1 1 1
( )
f
f ,
θ
=
ɶ
i i
y x
from
which we reconstruct a
1 1 1 1
( )
g
g ,
θ
=
i i
z y
.
Once the bottom layer is trained, the hidden representation of
the bottom layer
1
i
y
is henceforth used as the input of the
second layer
2
i
x
to train a new denoising autoencoder as
follows:
( )
qD |=
ɶ ɶ
2 2 2
i i i
x x x
(5)
2
( )
f
f ,
θ
=
ɶ
2 2 2
i i
y x
(6)
2 2
( )
g
g ,
θ
=
ɶ
2 2
i i
z y
(7)
Note that SDAE still minimizes the reconstruction loss
between a clean input
X
and its reconstruction representation
Z
. It thus forces the learning of a far more clever mapping than
the identity, e.g. extracting useful features for denoising [25].
Motivated by the physiological evidence that describing
patterns with less active neurons minimizes the probability of
destructive cross talk, a regularization term that penalizes a
deviation of the expected activation of the hidden units
(representation vector) from a fixed (low) level
ρ
is applied to
constrain the sparsity to the target activation function [26]. By
taking a single layer autoencoder for example, the target
activation function with sparsity constraint can be written as:
arg
f g
sparsity
j
,
ˆ
min L , , ,
θ θ
ρ ρ
X Z
(8)
1
+ KL( || )
X Z X Z
N
j j
j
sparsity
ˆ
ˆ
L , , , L ,
ρ ρ β ρ ρ
=
=
(9)
1
KL( || ) log (1 )log
1
j
j j
ˆ
ˆ
ˆ
ρ ρ
ρ ρ ρ ρ
ρ ρ
= +
(10)
where
β
is the weight of the sparsity penalty,
N
is the
number of features in the weight matrix,
ρ
is the target
average activation of the hidden units, and
1
= [ ]
m
j i= j i
ˆ
m
ρ
y
is the average activation of the
j th
hidden unit
j
y
over the
m
training data. The Kullback-Leibler divergence
KL( )
provides the sparsity constraint. As in sparse coding, a
non-redundant over-complete feature set is learned when
ρ
is
small. Here we set
=0 05
.
ρ
as suggested in [26]. Usually,
training a DAE is straightforward, where the back-propagation
algorithm can be used to compute the gradient of the objective
function [26, 27], and the same target activation function can be
used in all the layers when training SDAE. As the labels of the
input data are not needed in the training process above, the
layer-wise training step is actually unsupervised.
B. Saliency Detection via Deep Reconstruction Residual
As we mentioned in Section I, local and global
contrast-based methods lack the ability to precisely compute
the contrast between foreground objects and the background.
Inspired by the success of [18], this paper develops the
framework along the pipeline of modeling the background and
thereby separating salient objects from the background. We
follow the basic rule of photographic composition and assume
that the image boundary is mostly background. Then, the
contrast between salient object and the background can be more
precisely obtained. Specifically, we separately define four
boundaries for each image as shown in side-specific SDAE
training of Fig. 2. The height of two horizontal boundaries is
then percent of the image height and their width is the image
width. Similarly, the width of two vertical boundaries is then
percent of the image width and their height is the image height.
To valid the assumption that the image boundary is mostly
background, we compute the percentage of foreground pixels
(labeled in the ground truth) within the defined image
boundaries in two widely used databases (the SOD database [40]
and the SED dataset [50]). The statistic result shows that, for
most images, only less than 10% of pixels in the image
boundary are foreground pixels, which demonstrates that our
assumption is reasonable. For the small number of foreground
patches, the learning process of SDAE could decrease their
influence by minimizing the objective function with the
reconstruction error term when modeling the background.
As shown in Fig. 2, the proposed framework mainly consists
of three components: multi-scale inputs generation, salient
region detection via deep reconstruction residual, and post
processing. According to [28, 29], scale is an important factor
for identifying objects of different sizes. Similar to [28], we use
five scales as
1 1 1 1 1
{ , }
2 3 4 5 6
, , ,
of the original image size to
generate multi-scale inputs. It is more sensitive to small objects
at the large scale whereas it is more likely to highlight the inner
regions of large objects at the small scale.
Afterwards, we model the background using SDAEs
described in last subsection and then detect saliency by deep
reconstruction residuals for each scale. Specifically, we
construct four deep residual maps based on four boundaries
(Side-specific deep reconstruction residual maps shown in Fig.
2) and integrate them for the final map, which is referred to as
the separation/combination (SC) approach [19]. Specifically,
each image boundary is divided into patches of
6 6
×
pixels
with an overlapping of 2 pixels in each direction. Afterwards,
we establish the SDAE model with a visible (input) layer with
6 6 3=108
× ×
visible units and two hidden layers. According to

Citations
More filters
Journal ArticleDOI
TL;DR: A large-scale data set, termed “NWPU-RESISC45,” is proposed, which is a publicly available benchmark for REmote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU).
Abstract: Remote sensing image scene classification plays an important role in a wide range of applications and hence has been receiving remarkable attention. During the past years, significant efforts have been made to develop various datasets or present a variety of approaches for scene classification from remote sensing images. However, a systematic review of the literature concerning datasets and methods for scene classification is still lacking. In addition, almost all existing datasets have a number of limitations, including the small scale of scene classes and the image numbers, the lack of image variations and diversity, and the saturation of accuracy. These limitations severely limit the development of new approaches especially deep learning-based methods. This paper first provides a comprehensive review of the recent progress. Then, we propose a large-scale dataset, termed "NWPU-RESISC45", which is a publicly available benchmark for REmote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU). This dataset contains 31,500 images, covering 45 scene classes with 700 images in each class. The proposed NWPU-RESISC45 (i) is large-scale on the scene classes and the total image number, (ii) holds big variations in translation, spatial resolution, viewpoint, object pose, illumination, background, and occlusion, and (iii) has high within-class diversity and between-class similarity. The creation of this dataset will enable the community to develop and evaluate various data-driven algorithms. Finally, several representative methods are evaluated using the proposed dataset and the results are reported as a useful baseline for future research.

1,424 citations

Journal ArticleDOI
TL;DR: This paper proposes a novel and effective approach to learn a rotation-invariant CNN (RICNN) model for advancing the performance of object detection, which is achieved by introducing and learning a new rotation- Invariant layer on the basis of the existing CNN architectures.
Abstract: Object detection in very high resolution optical remote sensing images is a fundamental problem faced for remote sensing image analysis. Due to the advances of powerful feature representations, machine-learning-based object detection is receiving increasing attention. Although numerous feature representations exist, most of them are handcrafted or shallow-learning-based features. As the object detection task becomes more challenging, their description capability becomes limited or even impoverished. More recently, deep learning algorithms, especially convolutional neural networks (CNNs), have shown their much stronger feature representation power in computer vision. Despite the progress made in nature scene images, it is problematic to directly use the CNN feature for object detection in optical remote sensing images because it is difficult to effectively deal with the problem of object rotation variations. To address this problem, this paper proposes a novel and effective approach to learn a rotation-invariant CNN (RICNN) model for advancing the performance of object detection, which is achieved by introducing and learning a new rotation-invariant layer on the basis of the existing CNN architectures. However, different from the training of traditional CNN models that only optimizes the multinomial logistic regression objective, our RICNN model is trained by optimizing a new objective function via imposing a regularization constraint, which explicitly enforces the feature representations of the training samples before and after rotating to be mapped close to each other, hence achieving rotation invariance. To facilitate training, we first train the rotation-invariant layer and then domain-specifically fine-tune the whole RICNN network to further boost the performance. Comprehensive evaluations on a publicly available ten-class object detection data set demonstrate the effectiveness of the proposed method.

1,370 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: Evaluations on four benchmark datasets and comparisons with other 11 state-of-the-art algorithms demonstrate that DHSNet not only shows its significant superiority in terms of performance, but also achieves a real-time speed of 23 FPS on modern GPUs.
Abstract: Traditional1 salient object detection models often use hand-crafted features to formulate contrast and various prior knowledge, and then combine them artificially. In this work, we propose a novel end-to-end deep hierarchical saliency network (DHSNet) based on convolutional neural networks for detecting salient objects. DHSNet first makes a coarse global prediction by automatically learning various global structured saliency cues, including global contrast, objectness, compactness, and their optimal combination. Then a novel hierarchical recurrent convolutional neural network (HRCNN) is adopted to further hierarchically and progressively refine the details of saliency maps step by step via integrating local context information. The whole architecture works in a global to local and coarse to fine manner. DHSNet is directly trained using whole images and corresponding ground truth saliency masks. When testing, saliency maps can be generated by directly and efficiently feedforwarding testing images through the network, without relying on any other techniques. Evaluations on four benchmark datasets and comparisons with other 11 state-of-the-art algorithms demonstrate that DHSNet not only shows its significant superiority in terms of performance, but also achieves a real-time speed of 23 FPS on modern GPUs.

770 citations


Cites background or methods from "Background Prior-Based Salient Obje..."

  • ...Background prior [9-11] hypothesizes that regions near image boundaries are probably backgrounds....

    [...]

  • ..., superpixels used in [9-11, 16-19] and object proposals used in [14]) either as the basic computational units to predict saliency or as the post-processing methods to smooth saliency maps....

    [...]

Journal ArticleDOI
03 Apr 2017
TL;DR: In this paper, the authors proposed a large-scale data set, termed "NWPU-RESISC45", which is a publicly available benchmark for remote sensing image scene classification (RESISC), created by Northwestern Polytechnical University.
Abstract: Remote sensing image scene classification plays an important role in a wide range of applications and hence has been receiving remarkable attention. During the past years, significant efforts have been made to develop various data sets or present a variety of approaches for scene classification from remote sensing images. However, a systematic review of the literature concerning data sets and methods for scene classification is still lacking. In addition, almost all existing data sets have a number of limitations, including the small scale of scene classes and the image numbers, the lack of image variations and diversity, and the saturation of accuracy. These limitations severely limit the development of new approaches especially deep learning-based methods. This paper first provides a comprehensive review of the recent progress. Then, we propose a large-scale data set, termed “NWPU-RESISC45,” which is a publicly available benchmark for REmote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU). This data set contains 31 500 images, covering 45 scene classes with 700 images in each class. The proposed NWPU-RESISC45 1) is large-scale on the scene classes and the total image number; 2) holds big variations in translation, spatial resolution, viewpoint, object pose, illumination, background, and occlusion; and 3) has high within-class diversity and between-class similarity. The creation of this data set will enable the community to develop and evaluate various data-driven algorithms. Finally, several representative methods are evaluated using the proposed data set, and the results are reported as a useful baseline for future research.

767 citations

Journal ArticleDOI
TL;DR: The underlying relationship among OD, SOD, and COD is revealed and some open questions are discussed as well as several unsolved challenges and promising future works are pointed out.
Abstract: Object detection, including objectness detection (OD), salient object detection (SOD), and category-specific object detection (COD), is one of the most fundamental yet challenging problems in the computer vision community. Over the last several decades, great efforts have been made by researchers to tackle this problem, due to its broad range of applications for other computer vision tasks such as activity or event recognition, content-based image retrieval and scene understanding, etc. While numerous methods have been presented in recent years, a comprehensive review for the proposed high-quality object detection techniques, especially for those based on advanced deep-learning techniques, is still lacking. To this end, this article delves into the recent progress in this research field, including 1) definitions, motivations, and tasks of each subdirection; 2) modern techniques and essential research trends; 3) benchmark data sets and evaluation metrics; and 4) comparisons and analysis of the experimental results. More importantly, we will reveal the underlying relationship among OD, SOD, and COD and discuss in detail some open questions as well as point out several unsolved challenges and promising future works.

564 citations


Cites background from "Background Prior-Based Salient Obje..."

  • ...One of the earliest pioneering works is [45], where Han et al....

    [...]

References
More filters
Journal ArticleDOI
28 Jul 2006-Science
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

16,717 citations

Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations


"Background Prior-Based Salient Obje..." refers background in this paper

  • ...These deep architectures have been shown to lead to state-of-the-art results on a number of classification and regression problems [24]....

    [...]

  • ...As a form of neural network, the classical autoencoder [24] is an unsupervised learning algorithm that applies backpropagation and sets the target values of the network outputs to be equal to the inputs....

    [...]

  • ...According to [24] and [25], x̃i = q D(x̃i|xi) is implemented by randomly selecting a fraction (10% in this paper) of the input data and forcing them to be zero....

    [...]

Journal ArticleDOI
TL;DR: In this article, a visual attention system inspired by the behavior and the neuronal architecture of the early primate visual system is presented, where multiscale image features are combined into a single topographical saliency map.
Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

10,525 citations

01 Jan 1998
TL;DR: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented, which breaks down the complex problem of scene understanding by rapidly selecting conspicuous locations to be analyzed in detail.

8,566 citations


"Background Prior-Based Salient Obje..." refers background or methods in this paper

  • ...(b) Results from one local contrast method [5]....

    [...]

  • ...the center-surround difference [5], [6], [12], [13], incremental coding length [10], and self-resemblance [14]....

    [...]

  • ...[5] proposed three biological plausible features including color, intensity, and orientation....

    [...]

Journal ArticleDOI
TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.
Abstract: This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.

5,791 citations


"Background Prior-Based Salient Obje..." refers methods in this paper

  • ...Specifically, a graph-based segmentation algorithm [36] is used to decompose the image into a number...

    [...]

Frequently Asked Questions (15)
Q1. What are the contributions in this paper?

This paper proposes a novel framework for saliency detection by first modeling the background and then separating salient objects from the background. The authors develop stacked denoising autoencoders with deep learning architectures to model the background where latent patterns are explored and more powerful representations of data are learnt in an unsupervised and bottom up manner. 

For the further work, the authors tend to extend the proposed work in the following directions. Second, the proposed method can be extended to saliency detection in dynamic videos and many other applications such as image retrieval, image categorization, and image collection visualization. 

As a form of neural network, the classical autoencoder [24] is an unsupervised learning algorithm that applies back-propagation and sets the target values of the network outputs to be equal to the inputs. 

training a DAE is straightforward, where the back-propagation algorithm can be used to compute the gradient of the objective function [26, 27], and the same target activation function can be used in all the layers when training SDAE. 

In order to take into consideration both the dependency between pixels and the location of the errors, a weighting function is applied to the errors as= ( ) w E min E,E ⋅A Β . 

After normalization, the deep reconstruction residual maptopR , bottomR , leftR , and rightR are obtained based on the SDAEmodels for the top, bottom, left and right image boundary subsets, respectively. 

For the small number of foreground patches, the learning process of SDAE could decrease their influence by minimizing the objective function with the reconstruction error term when modeling the background. 

Autoencoders are simple learning neural networks which aim to transform inputs into outputs with the least possible amount of distortion for learning latent patterns of the given data. 

the proposed method can be extended to saliency detection in dynamic videos and many other applications such as image retrieval, image categorization, and image collection visualization. 

the proposed work casted separation of salient objects from the background as a problem of calculating reconstruction residual of SDAE. 

if the sparsity constraint is set too big, it normally leads to less stable and discontinuous detection results (as shown in the forth column of Fig. 7). 

To their best knowledge, this dataset is one of the largest test sets for salient object detection whose ground truth is in the form of manually labeled accurate object contours instead of rough bounding boxes. 

The subjective evaluations by comparing with the ground truth suggest that the proposed method can yield saliency maps correctly and robustly in all three datasets. 

From Fig. 15, it is observed that the proposed approach can achieve the highest weighted F-measure also shows that the weighted recall values of most of the state-of-the-art are less than 0.6 whereas the proposed approach can achieve the highest weighted recall value that is around 0.64, which indicates the proposed method tends t the entire salient objects. 

As defined in [51], the matrix A captures the dependency between foreground pixels based on the Euclidean distance and the matrix Β assigns importance weights to false detections according to their distance from theforeground.