Journal Article•DOI•

Background Prior-Based Salient Object Detection via Deep Reconstruction Residual

Q: What are the future works in this paper?

For the further work, the authors tend to extend the proposed work in the following directions. Second, the proposed method can be extended to saliency detection in dynamic videos and many other applications such as image retrieval, image categorization, and image collection visualization.

Q: What is the definition of the autoencoder?

As a form of neural network, the classical autoencoder [24] is an unsupervised learning algorithm that applies back-propagation and sets the target values of the network outputs to be equal to the inputs.

Q: What is the method for whitening the deep reconstruction residuals?

After normalization, the deep reconstruction residual maptopR , bottomR , leftR , and rightR are obtained based on the SDAEmodels for the top, bottom, left and right image boundary subsets, respectively.

Q: How many foreground patches are used in the training process?

For the small number of foreground patches, the learning process of SDAE could decrease their influence by minimizing the objective function with the reconstruction error term when modeling the background.

Q: How can the authors extend the proposed method to other applications?

the proposed method can be extended to saliency detection in dynamic videos and many other applications such as image retrieval, image categorization, and image collection visualization.

Q: What is the proposed method for calculating residual of SDAE?

the proposed work casted separation of salient objects from the background as a problem of calculating reconstruction residual of SDAE.

Junwei Han¹, Dingwen Zhang¹, Xintao Hu¹, Lei Guo¹, Jinchang Ren², Feng Wu³ - Show less +2 more•Institutions (3)

Northwestern Polytechnical University¹, University of Strathclyde², University of Science and Technology of China³

01 Aug 2015-IEEE Transactions on Circuits and Systems for Video Technology (IEEE)-Vol. 25, Iss: 8, pp 1309-1321

TL;DR: A novel framework for saliency detection is proposed by first modeling the background and then separating salient objects from the background by developing stacked denoising autoencoders with deep learning architectures to model the background.

read less

Abstract: Detection of salient objects from images is gaining increasing research interest in recent years as it can substantially facilitate a wide range of content-based multimedia applications. Based on the assumption that foreground salient regions are distinctive within a certain context, most conventional approaches rely on a number of hand-designed features and their distinctiveness is measured using local or global contrast. Although these approaches have been shown to be effective in dealing with simple images, their limited capability may cause difficulties when dealing with more complicated images. This paper proposes a novel framework for saliency detection by first modeling the background and then separating salient objects from the background. We develop stacked denoising autoencoders with deep learning architectures to model the background where latent patterns are explored and more powerful representations of data are learned in an unsupervised and bottom-up manner. Afterward, we formulate the separation of salient objects from the background as a problem of measuring reconstruction residuals of deep autoencoders. Comprehensive evaluations of three benchmark datasets and comparisons with nine state-of-the-art algorithms demonstrate the superiority of this paper.

...read moreread less

Summary (4 min read)

Jump to: [Introduction] – [II. THE PROPOSED APPROACH] – [A. Stacked Denoising Autoencoder (SDAE)] – [B. Saliency Detection via Deep Reconstruction Residual] – [C. Post Processing] – [1) Image organization refinement] – [2) Region smoothing] – [III. EXPERIMENTS] – [A. Evaluation Metrics] – [B. Parameters Analysis and Model Evaluation] – [C. Evaluations on the ASD D] – [D. Evaluations on the SOD D] – [E. Evaluations on the SED dataset] – [F. Running time] and [IV. CONCLUSION]

Introduction

A few recent approaches tried to learn better representations from natural scenes for saliency detection by using independent component analysis (ICA) [8], sparse coding [9, 10], and low-rank matrix recovery [11].
To be specific, in [15] and [16] the global contrast is derived in the frequency domain with the hypothesis that salient regions are normally less frequent.
They represent the image as a close-loop graph with superpixels as nodes.
Fig. 2 illustrates the workflow of the proposed framework.

II. THE PROPOSED APPROACH

The authors discuss the proposed method for salient object detection in details.
It includes three subsections, which in turn introduce SDAE, the proposed salient detection framework, and two useful post-processing steps, respectively.

A. Stacked Denoising Autoencoder (SDAE)

Autoencoders are simple learning neural networks which aim to transform inputs into outputs with the least possible amount of distortion for learning latent patterns of the given data.
Specifically, it includes an encoding process and a decoding process.
Usually, training a DAE is straightforward, where the back-propagation algorithm can be used to compute the gradient of the objective function [26, 27], and the same target activation function can be used in all the layers when training SDAE.

B. Saliency Detection via Deep Reconstruction Residual

As the authors mentioned in Section I, local and global contrast-based methods lack the ability to precisely compute the contrast between foreground objects and the background.
The authors follow the basic rule of photographic composition and assume that the image boundary is mostly background.
Specifically, the authors separately define four boundaries for each image as shown in side-specific SDAE training of Fig.
Finally, the four residual maps are linearly combined to generate the saliency map R S . =R top bottom left rightS R R R R+ + + /4 (12).

C. Post Processing

As discussed above, the authors compute saliency map R S at five different image scales to account for scale changes in salient objects.
To integrate salient regions in different scales, the authors use the average value of the five single scale saliency maps to generate the multi-scale integrated saliency map R S .
To further refine the results, two post-processing steps are adopted on the basis of the image organization priors and the region property as presented in details below.

1) Image organization refinement

According to the visual organization rules in [33], these cases can be refined by considering the visual contextual effect.
In the first component, as suggested by [34], which states that the salient pixels tend to group together, as they typically correspond to real objects in the scene, the authors propose to use a self-adaptive threshold ( ) R t = mean S to obtain the salient cluster firstly.
In the second component, to deal with the case where highlighted regions omit a bit of real foreground, the authors follow [35] to include the immediate context by weighting the saliency value of each pixel based on their distance to the high salient pixel locations.
To encode immediate context information, high salient pixel locations = R S tΦ > are found and the saliency value at all pixel locations are weighted by their distance to Φ .

2) Region smoothing

In order to highlight the entire salient object uniformly and recover more edge information, inspired by [35], the authors refine the saliency of each pixel using the region information.
Specifically, a graph based segmentation algorithm [36] is used to decompose the image into a number of small regions and the final saliency of each region is calculated by the average saliency value of all the pixels within it.
Examples of region smoothing results are shown in the fifth column of Fig. 4.

III. EXPERIMENTS

To evaluate the performance of the proposed salient object detection framework, the authors compared it with 9 state-of-the-art approaches, which have been published within last 3 years and in top journals or conferences.
To obtain the performance of these 9 methods, the authors adopted either the author-provided implementations or author-provided saliency maps.
To their best knowledge, this dataset is one of the largest test sets for salient object detection whose ground truth is in the form of manually labeled accurate object contours instead of rough bounding boxes.
It can be observed that, compared with PD, GBMR, GS-S, GS-G, BLSM, and CNTX, the proposed method can highlight salient region more uniformly.

A. Evaluation Metrics

By following previous works of [9, 12, 15, 16, 34, 41-43], four metrics are adopted in their experiments to quantitatively measure the performance of saliency map, which include the receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), precision recall (PR) curve, and the average precision (AP).
Observing the Gaussian-like distributions of the saliency value in the proposed saliency maps, an adaptive threshold T = +µ σ as suggested in [44] is used to segment the saliency maps.
For each segmented foreground binary map T SF under the adaptive threshold T , the authors follow [51] to evaluate it by using the weighted F-measure.
In order to take into consideration both the dependency between pixels and the location of the errors, a weighting function is applied to the errors as = ( ) w E min E,E ⋅A Β .
Then, the weighted true positive w TP , the weighted false positive w FP and the weighted false negative w FN can be calculated by 1051-8215 (c) 2013 IEEE.

B. Parameters Analysis and Model Evaluation

The authors analyze the effect of a few key parameters in the proposed model on performance.
Here the authors conducted the evaluation on the SOD and SED datasets.
Some examples of the experimental results obtained under different β are also given in Fig.
From the second and the third column of Fig. 7, the authors can see that for the images with clustered background, the sparsity is an essential element for suppressing the saliency of the background regions.
Similar phenomenon is also discovered in [48, 49].

C. Evaluations on the ASD D

The authors conducted quantitative comparisons on the ASD dataset -of-the-art methods on the ASD dataset.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE using ROC, PR, AUC, AP, and weighted performance metric.
From the ROC that the proposed method achieves the highest true positive rate when the false positive rate is between about 0.05 and 1. result, the proposed method outperforms other 9 algorithms in terms of ROC and AUC.
The statistics results can reflect the distributions of the true salient pixels and true background pixels on the calculated saliency value.

D. Evaluations on the SOD D

The authors also conducted the comparisons on the more challenging -of-the-art methods on the SOD dataset.
All the comparison results, including ROC, AUC, PR, AP, and weighted F-measure, are shown in Figs. 13-15.
From Fig. 15, it is observed that the proposed approach can achieve the highest weighted F-measure also shows that the weighted recall values of most of the state-of-the-art are less than 0.6 whereas the proposed approach can achieve the highest weighted recall value that is around 0.64, which indicates the proposed method tends t the entire salient objects.
For the foreground distribution and background distribution, similar observations can be found in comparison of results obtained from different approaches.
As shown in Fig. 16, the distributions on the SOD dataset tend to worse obviously.

E. Evaluations on the SED dataset

The proposed approach was also tested on the SED database, another challenging dataset.
As GS-S and GS provided their codes and their results on this dataset, the authors are -CLICK HERE TO EDIT) < score.
More encouragingly, compared with other state the proposed method has achieved the higher t in the whole ROC curve, and the higher precision values along almost the whole PR curve as well.
Similar to the SOD dataset, SED dataset also contains a large number of images with complicated content and multiple salient objects.
The experimental results show that the proposed algorithm has more powerful capability to handle these to.

F. Running time

Table II lists the average execution time in processing an image of size 400×300 by using different approaches.
For the implementation of the proposed method, the authors used the parallel computing toolbox of MATLAB and executed the code on the NVIDIA GPU named GeForce GTX Titan Black.
For other state-of-the-art approaches, the authors used the source codes provided by their authors.
The authors did not compare with GS because the corresponding codes have not been released by the authors.
As can be seen, the proposed algorithm has the moderate computational complexity.

IV. CONCLUSION

The authors have proposed a bottom detection framework based on the background prior.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 13 approaches is twofold.
First, instead of using traditional hand-designed features, the proposed algorithm adopted SDAE with deep structures to learn more powerful representations for saliency computation.
For the further work, the authors tend to extend the proposed work in the following directions.

Did you find this useful? Give us your feedback

Figures (18)

Fig. 3. The process to generate deep reconstruction residual map.

Fig. 16. Saliency value distribution of all foreground and background pixels in the

Fig. 14. The PR curves and AP scores for the proposed method and 9 state

Fig. 15. The results of weighted F-measure for the proposed method and 9 state-of-the-art methods on the SOD dataset.

Fig. 4. Experimental results of some examples after each step in the proposed method.

Fig. 11. The results of weighted F-measure for the proposed method and 9 state-of-the-art methods on the ASD dataset.

Fig. 12. Saliency value distribution of all foreground and background pixels in the ASD dataset

Fig. 10. The PR curves and AP scores for the proposed method and 9 state

Fig. 9. The ROC curves and AUC scores for the proposed method and 9 state

Fig. 1. Some examples of saliency detection. (a) Input images. (b) Results from one local contrast method [5]. (c) Results from one global contrast method [15]. (d) Results from the background prior based method [18]. (e) Results from the proposed method. (f) Ground truth salient object masks.

Fig. 5. A number of comparison results of ours, 9 state-of-the-art approaches, and the ground truth. From the left to the right, the first four examples are from the ASD dataset, the middle four examples are from the SOD dataset, and the last two examples are from the SED dataset.

Fig. 2. The workflow of the proposed framework.

Fig. 17. The ROC curves, AUC scores, PR curves, and AP scores for the proposed method and 7 state

Fig. 18. The results of weighted F-measure for the proposed method and 7 state-of-the-art methods on the SED dataset.

TABLE I DEMONSTRATION OF THE EFFECTIVENESS OF THE KL DIVERGENCE

Fig. 8. AUCs and APs of different models.

Fig. 6. AUCs and APs with different sparsity penalty weights.

Fig. 7. Some experimental results obtained under different parameter setting.

Content maybe subject to copyright Report

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

permitted.

However, permission to use this material for any other purposes

must be obtained from the IEEE by sending an email to

pubs-permissions@ieee.org

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

Abstract—Detection of salient objects from images is gaining

increasing research interest in recent years as it can substantially

facilitate a wide range of content-based multimedia applications.

Based on the assumption that foreground salient regions are

distinctive within a certain context, most conventional approaches

rely on a number of hand designed features and their

distinctiveness measured using local or global contrast. Although

these approaches have shown effective in dealing with simple

images, their limited capability may cause difficulties when

dealing with more complicated images. This paper proposes a

novel framework for saliency detection by first modeling the

background and then separating salient objects from the

background. We develop stacked denoising autoencoders with

deep learning architectures to model the background where latent

patterns are explored and more powerful representations of data

are learnt in an unsupervised and bottom up manner. Afterwards,

we formulate the separation of salient objects from the

background as a problem of measuring reconstruction residuals

of deep autoencoders. Comprehensive evaluations on three

benchmark datasets and comparisons with 9 state-of-the-art

algorithms demonstrate the superiority of the proposed work.

Index Terms—salient object detection, stacked denoising

autoencoder, background prior, deep reconstruction residual.

I. I

NTRODUCTION

ALIENT

object detection aiming to discover the most

important and informative parts in an image is gaining

intensive research attention recently as it can serve as a base for

a large number of multimedia applications such as image

resizing, image montage, action analysis and visual recognition

[1-4]. Based on the underlying hypothesis that the salient

stimulus is distinct from its contextual stimuli, most existing

saliency detection models need to solve two key problems: i)

extract effective features to represent the image and, ii) develop

an optimal mechanism to measure the distinctiveness over the

extracted features.

The performance of saliency detection models heavily relies

Manuscript received on April 14, 2014. This work was partially supported

by the National Science Foundation of China under Grant 61103061, 91120005,

and 61473231.

Junwei Han, Dingwen Zhang, Xintao Hu, and Lei Guo are with School of

Automation, Northwestern Polytechnical University, Xi’an China. (phone and

fax: 86-29-88431318; e-mail: junweihan2010@gmail.com).

Jinchang Ren is with the Department of Electronic and Electrical

Engineering, University of Strathclyde, UK.

Feng Wu is with School of Information Science, University of Science and

Technology of China.

on the features (data representations) being used. In the last 15

years, a variety of features have been proposed for the task of

image saliency detection. The earliest saliency computation

model by Itti et al. [5] proposed three biological plausible

features including color, intensity, and orientation. In Judd et al.

[6], besides Itti's three features, several new features were

introduced to characterize image content, which include the

local energy of the steerable pyramid filters, subband pyramids

based features, 3D color histogram, and horizon line detector.

As visual attention could be directed by specific objects, some

detectors of face, car, and person were treated as features for

detecting saliency [6, 7]. All these feature representations are

hand-designed and require significant amounts of domain

knowledge. However, hand-designed features in general suffer

poor generalization capability for different images, especially

due to the lack of thorough understanding of the biological

mechanisms and principles of human visual attention as well as

weak human intuition involved. A few recent approaches tried

to learn better representations from natural scenes for saliency

detection by using independent component analysis (ICA) [8],

sparse coding [9, 10], and low-rank matrix recovery [11].

Nevertheless, due to the shallow-structured architectures used

these methods still have limited representational power and are

insufficient to capture high-level information and latent

patterns of complex image data. To overcome such drawbacks,

in this paper, we investigate the feasibility of learning more

powerful representation directly from the raw image data itself

in an unsupervised way for the task of saliency detection.

The saliency or distinctiveness is typically measured by

image contrast computation over features, where various

contrast measures have been presented. Depending on the

extent of context in which the contrast is calculated, these

approaches can be classified into local-contrast based methods

and global-contrast based methods. Local-contrast based

methods estimate the saliency of an image pixel or an image

patch by calculating the contrast against its local neighborhood,

and some representative local methods include the

center-surround difference [5, 6, 12, 13], incremental coding

length [10], and self-resemblance [14]. Global-contrast based

methods characterize the saliency of an image region as the

uniqueness in the entire image. Previous literatures have

proposed a variety of approaches to model the global contrast

from different perspectives. To be specific, in [15] and [16] the

global contrast is derived in the frequency domain with the

hypothesis that salient regions are normally less frequent. Han

et al. [9] and Zhang et al. [8] utilized the Gaussian models to

Background Prior Based Salient Object

Detection via Deep Reconstruction Residual

Junwei Han, Dingwen Zhang, Xintao Hu, Lei Guo, Jinchang Ren, and Feng Wu

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

calculate the global contrast. Cheng et al. [17] proposed to

model the global contrast on the region level where each

region's contrast is generated by a weighted summation of the

differences between itself and all other regions. Shen et al. [11]

represented a whole image as a low-rank matrix with sparse

noises where sparse noises denote the salient regions.

In spite of extensive efforts, local and global contrast based

approaches still suffer from some drawbacks. First, these

approaches normally can only highlight object boundaries but

fail to detect the whole target region uniformly as shown in the

examples given in Fig. 1. This problem may be alleviated in

some global-contrast based methods while the results yielded

are still unsatisfactory. Second, although the salient objects

often present high contrast, the inverse might unnecessarily be

true [11]. In many complex images (as shown in the third

example of Fig. 1), the background contains small-scale

high-contrast patterns which may lead to previous

contrast-based methods fail in such cases.

Essentially, the true aim of salient object detection is to find

objects that are distinctive from the image background. It needs

to calculate the contrast between the objects and the image

background and then select those with high contrast as the

salient objects. However, the local and global contrast-based

methods do not identify which regions form the image

background. They blindly assume the neighboring regions or

the entire image to be the background and then calculate the

contrast between each location and the assumed background.

As their assumed background may not be the real one, the

determined contrast also becomes incorrect, which in turn

reduces the performance of saliency detection. To overcome

these problems, a few emerging methods [18, 19] using

background priors were proposed based on the idea of

modeling the property of background first and thereby

separating salient objects from the background. Specially, Wei

et al. [18] exploited the boundary and connectivity priors about

the background in natural images and detected saliency based

on the geodesic distance. Considering that the salient object

may be partially cropped on the boundary, this work adopts an

existing saliency detection method [33] to compute the saliency

of boundary patches and generates weights for the virtual

background nodes. However, in some challenging images

where the work [33] could not calculate the saliency of

boundary patches precisely, the method of [18] is difficult to

obtain satisfactory results. Yang et al. [19] modeled saliency

detection as a manifold ranking problem and proposed a

two-stage scheme for graph labelling. They represent the image

as a close-loop graph with superpixels as nodes. In saliency

detection, they first use the nodes on the image boundary as

background seeds to rank other nodes in the graph. Then, in the

second stage, they select the salient nodes from the detection

results of the first stage and use them to refine the saliency of

other nodes in the graph. On the assumption that the image

boundary is mostly background, these methods result in a

background template. As a result, the contrast between salient

object and background can be precisely obtained. By

incorporating background priors into traditional contrast-based

methods, they show improved results in saliency detection.

However, existing background prior based methods still

have certain limitations. Typically, there are four scenarios

where performing background prior based saliency detection as

summarized below.

1) The entire image boundary is a large and smoothly

connected region (see the first row of Fig. 1);

2) The regions defined within the image boundary look

different whereas they may share certain latent pattern (see the

second row of Fig. 1);

3) The background is complex (for example, containing

small-scale high-contrast patterns) and regions of image

boundary are different as shown in the third row of Fig. 1;

4) Salient objects significantly touch the image boundary and

parts of them are wrongly considered as background as shown

in the fourth row of Fig. 1.

As can be seen in Fig. 1, existing background prior based

approaches [18] are effective for the first scenario and

moderately effective for the second scenario. However,

unsatisfactory results are produced in dealing with the last two

scenarios. In this paper, we propose a novel background prior

based saliency detection framework using stacked denoising

autoencoder (SDAE) with deep learning architectures. In the

proposed work, SDAE is used to model image background.

Rather than adopting hand-designed features as used in

previous works [18, 19], the deep-structured SDAE is

employed to learn more powerful representation directly from

the raw image data in an unsupervised way, which also enables

to capture the latent pattern of the input data hierarchically. It

thus helps to deal with the second scenario (shown in the

second row of Fig. 1) where the background regions share

latent patterns. Then, the measure of contrast between salient

objects and the background is formulated as the reconstruction

residuals in the deep-structured SDAE. Different from the

previous works [18, 19] which mainly focused on the way to

calculate the similarity or distinctiveness between a certain

image patch and the image boundary, the proposed work pays

more attention to modeling the background regions.

Fig. 1.

Some examples of saliency detection. (a) Input images. (b) Results

from one local contrast method [5]. (c) Results from one global contrast

method [15]. (d) Results from the background prior based method [18]. (e)

Results from the proposed method. (f) Ground truth salient object masks.

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

Specifically, the sparsity is considered when training SDAE

models, which is helpful to suppress the saliency of the

background regions. Therefore, it is robust in handling the third

scenario (shown in the third row of Fig. 1) where the most

challenging task is to avoid mis-highlighting the small-scale

high-contrast background regions in the saliency maps. In

addition, the learning process of SDAE with the usage of

stochastic corruption criteria is helpful to train a deep model for

better robustness and feature representation. Thus, the trained

robust SDAE shows promising performance in these scenarios.

Fig. 2 illustrates the workflow of the proposed framework.

First, we down sample the original image to multiple scales to

generate the multi-scale inputs. Afterwards, we explore the

background prior via SDAE and detect salient regions by deep

reconstruction residuals which can reflect the distinctness

between the background and salient regions. Finally, post

processes are applied to integrate the salient object detection

results for each scale of input and generate the final saliency

map by image organization refinement and region smoothing.

The rest of the paper is organized as follows. Section II

introduces the proposed approach in details. Section III

presents experimental results with quantitative evaluation in

comparison with a group of state-of-the-art approaches. Finally,

several concluding remarks are drawn in Section IV.

II. T

ROPOSED

PPROACH

In this section, we discuss the proposed method for salient

object detection in details. It includes three subsections, which

in turn introduce SDAE, the proposed salient detection

framework, and two useful post-processing steps, respectively.

A. Stacked Denoising Autoencoder (SDAE)

Autoencoders are simple learning neural networks which aim

to transform inputs into outputs with the least possible amount

of distortion for learning latent patterns of the given data. While

conceptually simple, they play an important role in machine

learning and feature representation. More recently,

autoencoders have taken center stage again in the “deep

architecture” approaches [20-23], where autoencoders are

stacked and pre-trained in an unsupervised fashion. These deep

architectures have been shown to lead to state-of-the-art results

on a number of classification and regression problems [24].

As a form of neural network, the classical autoencoder [24] is

an unsupervised learning algorithm that applies

back-propagation and sets the target values of the network

outputs to be equal to the inputs. Specifically, it includes an

encoding process and a decoding process. The encoding

process uses an encoding function

( )f ,

i f

to take a

nonlinear mapping from the visible input vector

to a hidden

representation vector

by using an affine transformation with

a projection matrix

and a bias

. Normally, the sigmoid

function

1 (1 ( ))sigm / exp

η η

= + −



is used as the

deterministic mapping as follows:

( ) ( )

f , sigm

= = +W

i i i

y x x b

(1)

A decoding function

( )g ,

i g

is adopted to map the hidden

representation

back to a reconstruction representation

through a similar transformation:

( ) ( )

g , sigm ' '

= = +W

i i i

z y y b

(2)

After the decoding process, the obtained reconstruction is

taken as a prediction of input

. The training of an

autoencoder is to optimize the parameters

={ , }

W b

and

={ , }' '

by minimizing the mean-squared reconstruction

error between the training data and their reconstructed data via:

arg ,

f g

min L

θ θ

X Z 

(3)

X Z

i i

L || ||

= −

∑





x z

(4)

where

={ }, ={ } [1, ]

i i

i m

∈X Z



x z

denote all the training and

reconstructed data, respectively.

Stacked autoencoder (SAE) is a deep learning architecture of

the classical autoencoders, which is built by stacking additional

unsupervised feature learning layers, and can be trained using

greedy methods for each additional layer. Specifically, once the

first layer is trained, the hidden representation of the first layer

can be treated as the input of the second layer. As a result, any

number of the K layers in this deep architecture can be trained

effectively. This deep architecture allows SAE to learn more

complex mapping from the input to hidden representations and

capture the latent patterns which reflects the most homogametic

property shared among the training data.

Fig. 2. The workflow of the proposed framework.

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TCSVT.2014.2381471, IEEE Transactions on Circuits and Systems for Video Technology

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

The stacked denoising autoencoder (SDAE) [25] is an

extension of the SAE. It builds a deep architecture by stacking

multiple layers of the denoising autoencoder (DAE) which

reconstructs the input into a corrupted and partially destroyed

version. By introducing stochastic corruption to the training

samples, SDAE can avoid over-fitting and achieve better learnt

features, where non-trivial features are robust to input noise and

useful for the further tasks. For a two-layered SDAE, it is done

by first corrupting the initial input

∈

into

by using a

stochastic mapping

= ( )

qD |

ɶ ɶ

i i i

x x x

. According to [24, 25],

= ( )

qD |

ɶ ɶ

i i i

x x x

is implemented by randomly selecting a

fraction (10% in this paper) of the input data and forcing them

to be zero. In the bottom layer, corrupted input

is then

mapped to a hidden representation

1 1 1

( )

f ,

    



i i

y x

from

which we reconstruct a

1 1 1 1

( )

g ,

 







  

i i

z y

Once the bottom layer is trained, the hidden representation of

the bottom layer





is henceforth used as the input of the

second layer

2



to train a new denoising autoencoder as

follows:

( )

qD |=

ɶ ɶ

2 2 2



i i i

x x x

(5)

( )

f ,



 2 2 2



i i

y x

(6)

2 2

( )

g ,



 2   2



i i

z y

(7)

Note that SDAE still minimizes the reconstruction loss

between a clean input

and its reconstruction representation

. It thus forces the learning of a far more clever mapping than

the identity, e.g. extracting useful features for denoising [25].

Motivated by the physiological evidence that describing

patterns with less active neurons minimizes the probability of

destructive cross talk, a regularization term that penalizes a

deviation of the expected activation of the hidden units

(representation vector) from a fixed (low) level

is applied to

constrain the sparsity to the target activation function [26]. By

taking a single layer autoencoder for example, the target

activation function with sparsity constraint can be written as:

arg

f g

sparsity

min L , , ,

θ θ

ρ ρ

X Z



(8)

+ KL( || )

X Z X Z

j j

sparsity

L , , , L ,

ρ ρ β ρ ρ

∑



  

(9)

KL( || ) log (1 )log

j j

ρ ρ

ρ ρ ρ ρ

ρ ρ

−

= + −

−

(10)

where

is the weight of the sparsity penalty,

is the

number of features in the weight matrix,

is the target

average activation of the hidden units, and

= [ ]

j i= j i

∑

is the average activation of the

j th

hidden unit

over the

training data. The Kullback-Leibler divergence

KL( )

⋅

provides the sparsity constraint. As in sparse coding, a

non-redundant over-complete feature set is learned when

small. Here we set

=0 05

as suggested in [26]. Usually,

training a DAE is straightforward, where the back-propagation

algorithm can be used to compute the gradient of the objective

function [26, 27], and the same target activation function can be

used in all the layers when training SDAE. As the labels of the

input data are not needed in the training process above, the

layer-wise training step is actually unsupervised.

B. Saliency Detection via Deep Reconstruction Residual

As we mentioned in Section I, local and global

contrast-based methods lack the ability to precisely compute

the contrast between foreground objects and the background.

Inspired by the success of [18], this paper develops the

framework along the pipeline of modeling the background and

thereby separating salient objects from the background. We

follow the basic rule of photographic composition and assume

that the image boundary is mostly background. Then, the

contrast between salient object and the background can be more

precisely obtained. Specifically, we separately define four

boundaries for each image as shown in side-specific SDAE

training of Fig. 2. The height of two horizontal boundaries is

then percent of the image height and their width is the image

width. Similarly, the width of two vertical boundaries is then

percent of the image width and their height is the image height.

To valid the assumption that the image boundary is mostly

background, we compute the percentage of foreground pixels

(labeled in the ground truth) within the defined image

boundaries in two widely used databases (the SOD database [40]

and the SED dataset [50]). The statistic result shows that, for

most images, only less than 10% of pixels in the image

boundary are foreground pixels, which demonstrates that our

assumption is reasonable. For the small number of foreground

patches, the learning process of SDAE could decrease their

influence by minimizing the objective function with the

reconstruction error term when modeling the background.

As shown in Fig. 2, the proposed framework mainly consists

of three components: multi-scale inputs generation, salient

region detection via deep reconstruction residual, and post

processing. According to [28, 29], scale is an important factor

for identifying objects of different sizes. Similar to [28], we use

five scales as

1 1 1 1 1

{ , }

2 3 4 5 6

, , ,

of the original image size to

generate multi-scale inputs. It is more sensitive to small objects

at the large scale whereas it is more likely to highlight the inner

regions of large objects at the small scale.

Afterwards, we model the background using SDAEs

described in last subsection and then detect saliency by deep

reconstruction residuals for each scale. Specifically, we

construct four deep residual maps based on four boundaries

(Side-specific deep reconstruction residual maps shown in Fig.

2) and integrate them for the final map, which is referred to as

the separation/combination (SC) approach [19]. Specifically,

each image boundary is divided into patches of

6 6

pixels

with an overlapping of 2 pixels in each direction. Afterwards,

we establish the SDAE model with a visible (input) layer with

6 6 3=108

× ×

visible units and two hidden layers. According to

HTML Viewer

Frequently Asked Questions (15)

Q1. What are the contributions in this paper?

This paper proposes a novel framework for saliency detection by first modeling the background and then separating salient objects from the background. The authors develop stacked denoising autoencoders with deep learning architectures to model the background where latent patterns are explored and more powerful representations of data are learnt in an unsupervised and bottom up manner.

Q2. What are the future works in this paper?

For the further work, the authors tend to extend the proposed work in the following directions. Second, the proposed method can be extended to saliency detection in dynamic videos and many other applications such as image retrieval, image categorization, and image collection visualization.

Q3. What is the definition of the autoencoder?

As a form of neural network, the classical autoencoder [24] is an unsupervised learning algorithm that applies back-propagation and sets the target values of the network outputs to be equal to the inputs.

Q4. How is the training of a DAE?

training a DAE is straightforward, where the back-propagation algorithm can be used to compute the gradient of the objective function [26, 27], and the same target activation function can be used in all the layers when training SDAE.

Q5. What is the weighting function for the error?

In order to take into consideration both the dependency between pixels and the location of the errors, a weighting function is applied to the errors as= ( ) w E min E,E ⋅A Β .

Q6. What is the method for whitening the deep reconstruction residuals?

After normalization, the deep reconstruction residual maptopR , bottomR , leftR , and rightR are obtained based on the SDAEmodels for the top, bottom, left and right image boundary subsets, respectively.

Q7. How many foreground patches are used in the training process?

For the small number of foreground patches, the learning process of SDAE could decrease their influence by minimizing the objective function with the reconstruction error term when modeling the background.

Q8. What is the definition of an autoencoder?

Autoencoders are simple learning neural networks which aim to transform inputs into outputs with the least possible amount of distortion for learning latent patterns of the given data.

Q9. How can the authors extend the proposed method to other applications?

the proposed method can be extended to saliency detection in dynamic videos and many other applications such as image retrieval, image categorization, and image collection visualization.

Q10. What is the proposed method for calculating residual of SDAE?

the proposed work casted separation of salient objects from the background as a problem of calculating reconstruction residual of SDAE.

Q11. What is the effect of the sparsity constraint on the detection of the background regions?

if the sparsity constraint is set too big, it normally leads to less stable and discontinuous detection results (as shown in the forth column of Fig. 7).

Q12. What is the description of the salient object detection framework?

To their best knowledge, this dataset is one of the largest test sets for salient object detection whose ground truth is in the form of manually labeled accurate object contours instead of rough bounding boxes.

Q13. What is the effect of the proposed method on the saliency map?

The subjective evaluations by comparing with the ground truth suggest that the proposed method can yield saliency maps correctly and robustly in all three datasets.

Q14. What is the method for achieving the highest weighted recall value?

From Fig. 15, it is observed that the proposed approach can achieve the highest weighted F-measure also shows that the weighted recall values of most of the state-of-the-art are less than 0.6 whereas the proposed approach can achieve the highest weighted recall value that is around 0.64, which indicates the proposed method tends t the entire salient objects.

Q15. What is the weighted f-measure for the foreground?

As defined in [51], the matrix A captures the dependency between foreground pixels based on the Euclidean distance and the matrix Β assigns importance weights to false detections according to their distance from theforeground.

Background Prior-Based Salient Object Detection via Deep Reconstruction Residual

Summary (4 min read)

Introduction

II. THE PROPOSED APPROACH

A. Stacked Denoising Autoencoder (SDAE)

B. Saliency Detection via Deep Reconstruction Residual

C. Post Processing

1) Image organization refinement

2) Region smoothing

III. EXPERIMENTS

A. Evaluation Metrics

B. Parameters Analysis and Model Evaluation

C. Evaluations on the ASD D

D. Evaluations on the SOD D

E. Evaluations on the SED dataset

F. Running time

IV. CONCLUSION

Figures (18)

Citations

Cites background or methods from "Background Prior-Based Salient Obje..."

Cites background from "Background Prior-Based Salient Obje..."

References

"Background Prior-Based Salient Obje..." refers background in this paper

"Background Prior-Based Salient Obje..." refers background or methods in this paper

"Background Prior-Based Salient Obje..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (15)

Q1. What are the contributions in this paper?

Q2. What are the future works in this paper?

Q3. What is the definition of the autoencoder?

Q4. How is the training of a DAE?

Q5. What is the weighting function for the error?

Q6. What is the method for whitening the deep reconstruction residuals?

Q7. How many foreground patches are used in the training process?

Q8. What is the definition of an autoencoder?

Q9. How can the authors extend the proposed method to other applications?

Q10. What is the proposed method for calculating residual of SDAE?

Q11. What is the effect of the sparsity constraint on the detection of the background regions?

Q12. What is the description of the salient object detection framework?

Q13. What is the effect of the proposed method on the saliency map?

Q14. What is the method for achieving the highest weighted recall value?

Q15. What is the weighted f-measure for the foreground?