scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Analysis of Scores, Datasets, and Models in Visual Saliency Prediction

TL;DR: A critical and quantitative look at challenges in saliency modeling and the way they affect model accuracy is pursued, providing a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in Saliency modeling.
Abstract: Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scan path sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scan path sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.

Summary (2 min read)

1. Introduction

  • A large number of models has been proposed for predicting where people look in scenes [1].
  • It also ben- efits many engineering applications (e.g., object detection and segmentation, content-aware image re-targeting, image in-painting, visual tracking, image and video compression, crowds analysis and social gaming [2][6][24][37][30], determining the importance of objects in a scene [48, 44], memorability of image regions [49], and object recall [50].
  • While being very effective, previous comparisons have not correctly addressed all challenging parameters in model accuracy.
  • Here, the authors thoroughly investigate these shortcomings with additional comparison of models over scanpath sequences.
  • The authors provide the latest update on saliency modeling, with the most comprehensive set of models, challenges/parameters, datasets, and measures.

2. Basic concepts and definitions

  • Here, the authors lay out the ground for the rest of the paper and explain some basic concepts of visual attention.
  • Saliency is a property of the perceived visual stimulus (bottom-up (BU)) or, at most, of the features that the visual system extracts from the stimulus (which can be manipulated by top-down (TD) cues).
  • A major distinction between mechanisms of visual attention is the bottom-up vs. top-down dissociation.
  • Note that these are not exclusive concepts.

3. Analysis of challenges and open problems

  • The authors discuss challenges that have emerged as more models have been proposed.
  • Some models explicitly (e.g., Judd) or implicitly (e.g., GBVS) have added center-bias (location prior) making fair comparison challenging.
  • Two other issues regarding scores are sensitivity to map normalization (a.k.a re-parameterization) and having welldefined bounds (and chance level).
  • When using the method in [25] (i.e., saliency from other images but at fixations of the current image), this type of AUC leads to the exact value of 0.5 for the central Gaussian (See Supp.).

4. Saliency benchmark

  • The authors resized saliency maps to the size of the original images onto which eye movements have been recorded.
  • Over the largest dataset (i.e, MIT), AWS, LG, AIM, Torralba performed better than other models.
  • Fig. 4 shows model performance over stimulus categories of the NUSEF dataset for each model and average over all models.
  • Algorithm 1 Scanpath evaluation PHASE 1: generating human scanpath and clusters Input: subjects’ fixations.

5. Model comparison over applications

  • Since models are already good at predicting fixations, some new measures are necessary to draw distinctions among them.
  • Later, the histogram of saccade statistic for an image is computed from all the observers and is L1 normalized.
  • Fig. 10 shows performance of individual features in classification.

6. Conclusions and future directions

  • The authors comparisons2 show that in general AWS, LG, HouNIPS, Judd, Rarity-G (smoothed version), AIM, and Torralba models performed higher than other models.
  • Thus the authors believe it is important to gather larger datasets, especially over new stimulus categories.
  • The authors showed that, from statistics of fixations, saccades, and saliency at fixations, it is possible to decode the stimulus category.
  • Another promising research direction is designing better saliency evaluation scores which: (1) are able to better distinguish fixated vs. non-fixated locations, and (2) are able to discount confounding parameters such as center-bias.

Did you find this useful? Give us your feedback

Figures (10)

Content maybe subject to copyright    Report

Analysis of scores, datasets, and models in visual saliency prediction
Ali Borji
Hamed R. Tavakoli
+
Dicky N. Sihite
Laurent Itti
Department of Computer Science, University of Southern California, Los Angeles
+
Center for Machine Vision Research, University of Oulu, Finland
Abstract
Significant recent progress has been made in developing
high-quality saliency models. However, less effort has been
undertaken on fair assessment of these models, over large
standardized datasets and correctly addressing confound-
ing factors. In this study, we pursue a critical and quanti-
tative look at challenges (e.g., center-bias, map smoothing)
in saliency modeling and the way they affect model accu-
racy. We quantitatively compare 32 state-of-the-art mod-
els (using the shuffled AUC score to discount center-bias)
on 4 benchmark eye movement datasets, for prediction of
human fixation locations and scanpath sequence. We also
account for the role of map smoothing. We find that, al-
though model rankings vary, some (e.g., AWS, LG, AIM, and
HouNIPS) consistently outperform other models over all
datasets. Some models work well for prediction of both fix-
ation locations and scanpath sequence (e.g., Judd, GBVS).
Our results show low prediction accuracy for models over
emotional stimuli from the NUSEF dataset. Our last bench-
mark, for the first time, gauges the ability of models to de-
code the stimulus category from statistics of fixations, sac-
cades, and model saliency values at fixated locations. In
this test, ITTI and AIM models win over other models. Our
benchmark provides a comprehensive high-level picture of
the strengths and weaknesses of many popular models, and
suggests future research directions in saliency modeling.
1. Introduction
A large number of models has been proposed for predict-
ing where people look in scenes [1]. But due to a lack of an
exhaustive coherent benchmarking system, to address sev-
eral issues such as evaluation measures (e.g., at least 4 types
of AUC measures have been used; supplement), center-bias,
map characteristics (e.g., smoothing), and dataset bias, a lot
of inconsistencies still exist. The discrepancy of results in
previous works calls for a unified approach for gauging the
progress in this field and for fair comparison of models.
Importance. Modeling visual saliency broadens our un-
derstanding of a highly complex cognitive behavior, which
may lead to subsequent findings in other areas (object and
scene recognition, visual search, etc.) [27][5]. It also ben-
efits many engineering applications (e.g., object detection
and segmentation, content-aware image re-targeting, image
in-painting, visual tracking, image and video compression,
crowds analysis and social gaming [2][6][24][37][30], de-
termining the importance of objects in a scene [48, 44],
memorability of image regions [49], and object recall [50].
Our contributions. We offer 3 main contributions: (1) dis-
cussing current challenges and directions in saliency mod-
eling (evaluation metrics, dataset bias, model parameters,
etc.) and proposing solutions, (2) comparing 32 models and
their pros and cons in a unified quantitative framework over
4 widely-used datasets for fixation prediction (on classic
and emotional stimuli) as well as scanpath prediction, and
(3) stimuli/task decoding using saliency and fixation statis-
tics. Hopefully, our study will open new directions and con-
versations and help better organize the saliency literature.
Previous benchmarks. Few attempts has been made for
saliency model comparison, but their shortcomings have
driven us to conduct a new benchmark considering the lat-
est progress. Borji et al. [30] compared 36 models over
3 datasets. Judd et al. [41] compared 9 models over only
300 images. Some works have compared salient object de-
tection and region-of-interest algorithms [42]. Some other
benchmarks have compared models over applications such
as image quality assessment [52]. While being very effec-
tive, previous comparisons have not correctly addressed all
challenging parameters in model accuracy. For example,
map smoothing which influences fixation prediction accu-
racy of models [43], or center-bias (tendency of humans
to look towards the center of an image, such that a triv-
ial model that predicts salience near the center may sur-
pass other saliency models) [28, 25], have not been ad-
dressed in previous benchmarks. Here, we thoroughly in-
vestigate these shortcomings with additional comparison of
models over scanpath sequences. We provide the latest up-
date on saliency modeling, with the most comprehensive set
of models, challenges/parameters, datasets, and measures.
2. Basic concepts and definitions
Here, we lay out the ground for the rest of the paper and
explain some basic concepts of visual attention.
There is often confusion between saliency and atten-

tion. Saliency is a property of the perceived visual stimulus
(bottom-up (BU)) or, at most, of the features that the visual
system extracts from the stimulus (which can be manipu-
lated by top-down (TD) cues). Attention is a much more
general concept that depends on many cognitive factors of
very high-level such as strategy for image search and in-
teractions between saliency and search strategy, as well as
subjective factors such as age and experience.
A major distinction between mechanisms of visual atten-
tion is the bottom-up vs. top-down dissociation. Bottom-up
attention is reflexive, fast, likely feed-forward, and mainly
deployed by stimulus saliency (e.g., pop-out). On the other
hand, top-down attention is deliberative, slow, and powerful
with variable selection criteria depending on the task.
Previous studies of attention differ mainly based on the
type of stimuli they have employed and the tasks they have
addressed. Visual stimuli used in neurophysiological and
modeling works include: static (synthetic search arrays in-
volving pop-out and conjunction search arrays, cartoons,
or photographs) and over spatio-temporal dynamic stimuli
(movies and interactive video games). These stimuli have
been exploited for studying visual attention over three types
of tasks: (1) free viewing, (2) visual search, and (3) interac-
tive tasks (games or real-world tasks [54]).
What is the unit of attention? Do we attend to spatial
locations, objects, or features? [5][27] A great deal of neu-
rophysiological and behavioral evidence exists for all three.
Space-based theories claim that humans deliberately attend
to spatial locations where a target may appear. Similar ob-
servations indicate that visual attention is essentially guided
by recognized objects, with low-level saliency contributing
only indirectly [50]. Features are well explored by neural
studies showing that neurons accommodate their response
properties to render an object of interest more salient. Note
that these are not exclusive concepts.
A closely related field to saliency modeling is salient re-
gion detection. While the goal of the former is to predict
locations that grab attention, the latter attempts to segment
the most salient object or region in a scene. Evaluation is
often done by measuring precision-recall of saliency maps
of a model against ground truth data (explicit saliency judg-
ments of subjects by annotating salient objects or clicking
on locations). Some models in two categories have com-
pared themselves against each other, without being aware
of the distinction and different goals of the models.
The majority of models described by the above concepts
have focused on bottom-up, space-based, and static atten-
tion for explaining eye movements in free-viewing tasks.
3. Analysis of challenges and open problems
We discuss challenges that have emerged as more models
have been proposed. These are open issues not only for
research but also for performing fair model comparison.
0 200 400 600 800 1000
5
10
15
20
25
30
35
40
Le Meur(27,40)
DOVES(101,29)
Kootstra(99,31)
Number of images Distance from screen
Toronto(120,11)
FIFA(180,8)
NUSEF
(758,25)
MIT
(1003,15)
*
*
{
Kienzle(200,14)
Einhauser(93,8)
Engman(90,8)
Tatler(120,22)
Engelke(90,8)
Reinagel(77,5)
40 60 80 100 120 140
1
2
3
4
5
6
7
8
9
Kienzle
(60,3)
Einhauser
(80,3)
Toronto
(75,4)
MIT
(48,3)
FIFA
(80,2)
NUSEF
(43,5)
Rainagel
(79,4)
Kootstra
(70,3)
Tatler
(60,5)
Engman
(85,2)
DOVES
(134,5)
Engelke(60,8)
Number of subjects
Presentation time
Figure 1. Parameters of 13 major eye movement datasets. The MIT [2]
dataset is the largest one with 1003 images. It has a high degree of pho-
tographer bias and few number of eye tracking subjects. LeMeur [21] has
only 27 images with the highest number of eye-tracking subjects (40). The
Toronto dataset [10] has 120 images mainly indoor and in-city scenes. The
5 common objects in Judd dataset are human, face, text, animals, and cars.
Dataset bias. Available eye movement datasets vary on
several parameters, for instance: number of images, num-
ber of viewers, viewing time per image, subject’s distance
from the screen, and stimulus variety [55]. Fig. 1 shows
some popular eye tracking data sets, among them some are
publicly available. Due to small size of the Toronto dataset
and small number of subjects, its sole usage is less encour-
aged. Perhaps the best options so far are NUSEF [4] and
Judd [2] datasets. NUSEF dataset (758 images and 25 sub-
jects) contains a large number of affective stimuli making
it more suitable for studying semantic attentional cues. As
Fig. 1 shows larger fixation datasets with many images and
eye-tracking subjects are needed. Because of the specialty
of datasets (different optimal weights for features over dif-
ferent datasets [36]), a fair evaluation is to compare models
over several datasets as presented in Sec. 4. Further, it has
been shown that fixation density maps from different lab-
oratories differ significantly due to inter-laboratory differ-
ences and experimental conditions [56].
A difficult challenge in fixation datasets which has
affected fair model comparison is “Center-Bias (CB)”,
whereby humans often appear to preferentially look near an
image’s center [28]. Two important causes for CB are: (1)
Viewing strategy where subjects start looking from the im-
age center and (2) A perhaps stronger, photographer bias,
which is the tendency of photographers to frame interest-
ing objects at the center. Annoyingly, due to CB in data, a
trivial saliency model that just consists of a Gaussian blob
at the center of the image, often scores higher than almost
all saliency models [2]. This can be verified from the av-
erage eye fixation maps of 3 popular datasets (See supple-
ment). We observed higher central fixation densities for im-
ages with objects at the center compared with those with
objects off the center. Another problem that is in essence
similar to CB is handling invalid filter responses at image
borders (“border effect“, e.g., AIM model [10]; See [25]).
Some models explicitly (e.g., Judd) or implicitly (e.g.,
GBVS) have added center-bias (location prior) making fair
comparison challenging. Three possible remedies are: (1)

0
0.4
0.8
0
0.2
0.4
0.5
0
0.4
0.8
1
1
1
2
0
0.5
1
1.5
0.5
0.7
0.9
AUC
NSS
0
0.5
1
1.5
0.5
0.7
0.9
5 10 25 50 75 100 125 140
AUC
NSS
0.5 1 2 3 5 8 10 15
0.5 1 2 3 5 8 10 15
0
40
80
100
0.98
0.99
1.00
AUC
CC
CC
CC
NSS
0.5
1
2
3
5
8
10
15
σ
σ σ
Border Size
cc 1.0, auc 1.0, nss 5.3
cc −0.03, auc 0.4, nss −0.2
300 x 300
cc −0.01, auc 0.4, nss −0.07
mean eye position(MEP)fixations
Figure 2. Score analysis. 1
st
column: scores of a saliency map made by
placing a variable Gaussian (σ
1
) at fixated locations, 2
nd
column: scores
of the central Gaussian blob (σ
2
), and the 3
rd
column: scores of the image
with variable border size. Results are averaged over 1000 runs with 10
randomly generated fixations from a Gaussian distribution to mimic center-
bias [28] in data similar to heatmap in Fig. 3. Image size: [300 300].
Every model adds a central Gaussian. This adds Gaussian
size and its weight as two additional parameters, (2) Col-
lecting datasets with no CB. This is difficult since, even if
we have an approach to uniformly distribute image content,
viewing strategy still exists, and (3) Designing suitable met-
rics which we consider as the most reasonable approach.
Evaluation metrics. Traditionally, saliency models have
been evaluated against eye movement datasets. In some
cases, accuracy is whether one can predict what changes
people will notice, or what they will remember or anno-
tate [1]. We use three popular metrics for saliency eval-
uation: (1) Correlation Coefficient (CC) between a model
(s) and human (h) saliency maps: CC(s,h) =
cov(s,h)
σ
s
σ
h
,
(2) Normalized Scanpath Saliency (NSS): the average of
saliency values at n fixations in a normalized map (N SS =
1
n
P
n
i=1
s(x
i
h
,y
i
h
)µ
s
σ
s
)[51], and (3) Area Under the ROC
Curve (AUC) where human fixations are considered as the
positive set and some points from the image are uniformly
chosen as the negative set. The saliency map is then treated
as a binary classifier to separate the positive samples from
negatives. By thresholding over this map and plotting true
positive rate vs. false positive rate an ROC curve is achieved
and its underneath area is calculated. Please see supplement
for a subtle discussion of variations of AUC metrics. KL [9]
and earth mover distance (EMD) [36] measures have also
been used for model evaluation. Some studies have evalu-
ated the sequence of fixations in scanpath [32, 31].
Fig. 2 shows analysis of how the above scores are af-
fected by smoothness of the saliency map and possible cen-
ter bias in the reference data. We generated some random
eye fixations (sampled from a Gaussian distribution) and
made a saliency map by convolving it with a Gaussian filter
with variable sigma σ
1
. Shown in the 1st column, increas-
ing σ
1
, reduces all 3 scores. Over AUC, however, the drop
is moderate and the range is very small meaning as long as
the hit rates are high, the AUC is high regardless of the false
alarm rate [36]
1
. Shown in the 2nd column, we placed a
Gaussian at the center of the image and calculated the score
again by varying the σ
2
of the central Gaussian as well as σ
1
of the Gaussian convolved with fixations (only for CC since
for NSS and AUC, fixation positions are used). Increasing
σ
2
, raises 3 scores up to the maximum match between Gaus-
sian and MEP map and then drops or saturates. CC scores
are raised by increasing σ
1
. Third column in Fig. 2 shows
that by increasing the border size, scores reach a maximum
and then drop, a similar effect to center-bias. These analy-
ses show that smoothing the saliency maps and the size of
the central Gaussian affect scores and should be accounted
for fair model comparison. NSS score is more sensitive to
smoothing. All of these scores suffer from center-bias.
Two other issues regarding scores are sensitivity to map
normalization (a.k.a re-parameterization) and having well-
defined bounds (and chance level). Some scores are invari-
ant to continuous monotonic nonlinearity (e.g., KL) while
some others are not (CC, NSS, and AUC). All scores are
invariant to saliency map shifting and scaling. Some scores
have well defined bounds (CC and AUC have lower and up-
per bounds) while some do not (KL and NSS; KL has lower
bound and NSS has upper bound and chance level of 0).
A proper score for tackling CB is shuffled AUC
(sAUC) [25] with the only difference to AUC being that
instead of selecting negative points randomly, all fixations
over other images are used as the negative set. This score is
not affected by σ
2
and border size in Fig. 2. sAUC value for
a central Gaussian and a white map is near 0.5 (i.e, fixations
from other images as the negative set [28]). When using the
method in [25] (i.e., saliency from other images but at fix-
ations of the current image), this type of AUC leads to the
exact value of 0.5 for the central Gaussian (See Supp.).
Features for saliency detection. Traditionally, intensity,
orientation, and color (in LAB and RGB spaces) have been
used for saliency derivation over static images. For dynamic
scenes, flicker and motion features have been added. Fur-
thermore, several other low-level features have been used
to estimate saliency (size, depth, optical flow, etc.). High-
level features (prior knowledge) such as faces [2], peo-
ple [2], cars [2], symmetry [8], signs, and text [35] have
been also incorporated. One challenge is detecting affective
(emotional) features and semantic (high-level knowledge)
scene properties (e.g., causality, action-influence) which
have been suggested to be important in guiding attention
(location and fixation duration) [4]. Models usually use all
channels for all sorts of stimuli which makes them highly
dependent on the false positive rates of employed feature
1
Note that in [36] and [2], AUC is calculated by thresholding the
saliency map and then measuring hit rate which is different from what we
(and also [25, 28, 43]) do by spreading random points on the image.

Gauss=0.05 AUC=0.7174Gauss=0.04 AUC=0.7227Gauss=0.03 AUC=0.7272Gauss=0.02 AUC=0.7274Gauss=0.01 AUC=0.7286
Image
human map
Figure 3. A sample saliency map smoothed by convolving with a variable-size
Gaussian kernel (for the AWS model over an image of the Toronto dataset).
detectors (e.g., face or car detector). Since existing models
use linear features, they render highly textured regions more
salient. Non-linear features (e.g., famous egg in the nest or
birthday candle images [25]) has been proposed but has not
been fully implemented.
Parameters. Models often have several design parame-
ters such as the number and type of filters, choice of non-
linearities, within-feature and across-scale normalization
schemes, smoothing, and center-bias. Properly tuning these
parameters is important in fair model comparison and is per-
haps best left for a model developer to optimize himself.
4. Saliency benchmark
We chose four widely-used datasets for model compari-
son: Toronto [10], NUSEF [4], MIT [2], and Kootstra [8].
Table 1 shows 30 models compared here. Additionally,
we implemented two simple yet powerful models, to serve
as baselines: Gaussian Blob (Gauss) and Human inter-
observer (IO). Gaussian blob is simply a 2D Gaussian
shape drawn at the center of the image; it is expected to
predict human gaze well if such gaze is strongly clustered
around the center. For a given stimulus, the human model
outputs a map built by integrating fixations from other sub-
jects than the one under test while they watched that same
stimulus. The map is usually smoothed by convolving with
a Gaussian filter. This inter-observer model is expected to
provide an upper bound on prediction accuracy of compu-
tational models, to the extent that different humans may be
the best predictors of each other. We resized saliency maps
to the size of the original images onto which eye movements
have been recorded. Please note that, besides models com-
pared here, some other models may exist that might perform
well (e.g., [37]), but are not publicly available or easily ac-
cessible. We leave them for future investigations.
We first measure how well a model performs at pre-
dicting where people look over static image eye movement
datasets. We report results using sAUC score as it has sev-
eral advantages over others. Results over other scores are
shown in supplement. Note however, sAUC score alone is
not the only criterion for our conclusions as it gives more
credit to the off center information and favors true positives
more. Next, we compare models for their ability of predict-
ing the saccade sequence. Our conclusions are based on the
premise that if a model is good it should perform well over
all configurations (i.e., score, datasets, and parameters).
Predicting fixation locations: Model scores and average
ranks using sAUC over four datasets are shown in Table 1.
We smoothed saliency map of each model by convolving
it with a Gaussian kernel (Fig. 3). We then plotted the
sAUC of each model over the range of standard deviations
of the Gaussian kernel in image width (from 0.01 to 0.13 in
steps of 0.01) and calculated the maximum value over this
range for each model. Compared with our rankings with
original maps (supplement), now some models get a better
score. Although the ranking order is not the same over four
datasets, some general patterns are noticeable. The Gaus-
sian model is the worst (not significantly better than chance)
over all datasets as we expected. There is a significant dif-
ference between models and IO model. This difference is
more profound over NUSEF and MIT datasets as they con-
tain many stimuli with complex high-level concepts.
AWS model is significantly better than all other models
followed by LG, AIM, Global Rarity, Torralba, HouCVPR,
HouNIPS, SDSR, and Judd. Over the largest dataset (i.e,
MIT), AWS, LG, AIM, Torralba performed better than other
models. Over the NUSEF dataset, AIM, LG, Torralba,
HouCVPR, HouNIPS models did the best. Kootstra, STB,
ITTI (due to different normalization and map sparseness
than ITTI98), and Marat ranked at the bottom. Interestingly,
AWS model on the Kootstra dataset performs as good as the
human IO. Our analyses show that CC, NSS, and AUC pro-
duce very high scores for the Gaussian; almost better than
all models thanks to its center-preference (see supplement).
Therefore, we do not recommend using them for saliency
model comparison. Considering rankings sAUC, CC, and
NSS scores, we noticed that models that performed well
over sAUC are also ranked on top using other scores.
Fig. 4 shows model performance over stimulus cate-
gories of the NUSEF dataset for each model and average
over all models. There is no significant difference over dif-
ferent categories of stimuli averaged over all models (In-
set; See also supplement) although it seems that models
perform better over face stimuli and the worst over por-
trait and nude (this pattern is more clear considering only
0.4
0.45
0.5
0.55
0.6
0.65
AIM
AWS
Bian
Entropy
GBVS
G-Rarity
Tavakkoli
HouCVPR
HouNIPS
Shuffled AUC
ITTI
Judd
L-Rarity
PQFT
SDSR
SUN
Surprise
Torralba
Variance
STB
Gauss
Human
event
face
nude
other
portrait
event face nude other portrait
0.4
0.5
0.6
0.7
NUSEF - 412
Figure 4. Model performance over categories of the NUSEF dataset.
Gauss and Human IO are excluded from the average (i.e., inset). Num-
ber of images: Event: 36, Face: 52, Nude: 20, Other: 181, Portrait: 123.

Model
Gaussian-Blob
Inter-observer (IO)
Variance
Entropy
Itti et al. (ITTI98)
Itti et al. (ITTI)
Torralba
Vocus (Frintrop)
Surprise (Itti & Baldi)
AIM (Bruce & Tsotsos)
Saliency Toolbox (STB)
GBVS (Harel et al.)
Le Meur et al.
HouCVPR (Hou & Zhang)
Local Rarity (Mancas)
Global Rarity (Mancas)
HouNIPS (Hou & Zhang)
Kootstra et al.
SUN (Zhang et al.)
Marat et al.
PQFT (Guo et al.)
Yin Li et al.
SDSR (Seo & Milanfar)
Judd et al.
Bian et al.
ESaliency (Avraham et al.)
Yan et al.
AWS (Diaz et al.)
Jia Li et al.
Tavakoli et al.
Murray et al.
LG (Borji & Itti)
Avg. score over models
Ref. [28] - - [32] [3] [33] [20] [6] [9] [10] [24] [7] [21] [11] [13] [13] [12] [8] [25] [23] [15] [18] [22] [2] [16] [14] [19] [17] [26] [34] [47] [38] -
Year - - - - 98 00 03 05 05 05 06 06 07 07 07 08 08 08 09 09 09 09 09 09 10 10 10 10 10 11 11 12 -
Code M M C C C C C C M M M S M M M M E M S M M M M M S M E E M 11 11 11 -
Category O O I I C C B C B/I I C G C S I I I C B C S S I P S G I C B B C I -
Torronto .50 .73 .66 .65 .63 .62 .69 .66 .63 .69 .62 .65 .66 .69 .65 .69 .69 .61 .67 .64 .68 .69 .69 .68 .61 .65 .68 .72 .67 .64 .64 .70 .66
NUSEF .49 .66 .62 .61 .57 .56 .63 - .59 .64 .56 .59 - .63 .60 .62 .63 - .61 - .61 - .61 .61 .63 - - .64 - .56 .57 .63 .60
MIT .50 .75 .65 .64 .62 .61 .67 .65 .63 .68 .58 .64 .57 .65 .63 .67 .65 .60 .65 .62 .66 .65 .65 .66 .61 .62 .64 .69 - .65 .65 .68 .64
Kootstra .50 .62 .58 .57 .58 .57 .59 .60 .58 .59 .57 .56 .57 .59 .58 .61 .59 .56 .56 .54 .58 .59 .60 .59 .57 .56 .58 .62 .56 - - .59 .58
Avg Rank - - 4.8 5.8 7.3 8.3 3 4.7 6.8 2.5 8.8 6.5 8 3.5 6 2.8 3.5 9.3 5.3 8 4.3 4 3.8 4 7 5.8 5 1 6.7 6.5 6.7 2.5 -
Table 1. Compared visual saliency models. Abbreviations are: M: Matlab, C: C/C++, E: Executables, S: Sent saliency maps. Note that STB [24] and
VOCUS [6] are two implementations of the Itti et al. [3] model. Numbers are maximum shuffled AUC scores of models by optimizing the saliency map
smoothness (Fig. 3). See supplement for optimal sigma values of Gaussian kernel σ where models take their maximums (σ from 0.01 : 0.01 : 0.13 in image
width). Model category belongs to one of these categories [30]: Cognitive (C), Bayesian (B), Decision-theoretic (D), Information-theoretic (I), Graphical
(G), Spectral-analysis (S), Pattern-classification (P), Others (O). We observe lower performance for the STB model compared with either ITTI or ITTI98
models over free-viewing datasets. Thus, it’s use instead on the ITTI model (e.g., for model-based behavioral studies) is not encouraged. We employ two
different versions of the Itti et al. model: ITTI98 and ITTI, which correspond to different normalization schemes. In ITTI98, each feature map’s contribution
to the saliency map is weighted by the squared difference between the globally most active location and the average activity of all other local maxima in the
feature map [3]. This gives rise to smooth saliency maps, which tend to correlate better with noisy human eye movement data. In the ITTI model [33], the
spatial competition for saliency is much stronger, and is implemented in each feature map as 10 rounds of convolution by a large difference-of-Gaussians
followed by half-wave rectification. This gives rise to much sparser saliency maps, which are more useful than the ITTI98 maps when trying to decide on
the single next location to look at (e.g., in machine vision and robotics applications). Kootstra dataset is the hardest one for humans (low IO agreement)
and models. Next hardest in the NUSEF dataset. Note that symmetry model of Kootstra can not compete with the other models over the Kootstra dataset
although there are many images with symmetric objects in this dataset. Numbers are rounded to their closest value (See supplement for more accurate
values). Borji et al. [30], results are on original images while here we report optimized results over smoothed images. In our experiments here we used the
myGauss=fspecial(’gaussian’,50,10) which was then normalized to [0 1]. In principle, smoothing Gaussian should be about 1
2
of the visual field.
0 2 4 6 8 10 12 14
0.5
0.52
0.54
Shuffled AUC (sAUC) score
Smoothing parameter (Gaussian size)
0.56
0.58
0.6
AIM
SUN
AWS
GBVS
HouCVPR
HouNIPS
ITTI
Judd
PQFTt
Model
AIM
SUN
AWS
GBVS
HouCVPR
HouNIPS
ITTI
Judd
PQFT
Gauss
0.585
0.587
0.598
0.569
0.588
0.584
0.531
0.543
0.591
0.50
max sAUC
score
Figure 5. sAUC scores over emotional images of the NUSEF dataset.
Results on the right-hand table are the maxima over smoothing range.
top-performing models). The AWS model did the best over
all categories. HouNIPS, Judd, SDSR, Yan, and AIM also
ranked at the top. Faces are often located at the center while
nude and event stimuli are mostly off-center. Humans are
more correlated for portrait, event, and nude stimuli. A sep-
arate analysis over the Kootstra dataset showed that mod-
els have difficulty in saliency detection over nature stim-
uli where there are less distinctive and salient objects (See
supplement). This means that much progress remains to be
done in saliency detection over stimuli containing concep-
tual stimuli (e.g, images containing interacting objects, ac-
tions such as grasping, living vs. non-living, object regions
inside a bigger object i.e., faces, body parts, etc).
Behavioral studies have shown that affective stimuli in-
Mickey, positive, valence = 7.4Wolf, neutral, valence = 4.21Elderlywoman, negative, valence = 3.26 Watermelon, positive, valence = 7.04Firehydrant, neutral, valence = 5.24Harassment, negative, valence = 3.19
Img No: 4621 Img No: 2590 Img No: 1302 Img No: 7100
Img No: 1999 Img No: 7325
Figure 6. Sample emotional images with positive, negative, and neutral
emotional valence from NUSEF dataset along with saliency maps of the
AWS model. Note that in some cases saliency misses fixations.
fluence the way we look at images. Humphrey et al. [45]
showed that initial fixations were more likely to be on emo-
tional objects than more visually salient neutral ones. Here
we take a close look at model differences over emotional
(affective) stimuli, using 287 images from the NUSEF be-
longing to the IAPS dataset [46]. Fig. 5 shows sAUC
scores of 10 models over affective stimuli. These val-
ues are smaller than the ones for NUSEF (non-affective)
shown in Table 1. Our results (using shuffled AUC and
with smoothing similar to Table 1; see supplement) sug-
gest that only a fraction of fixations landed on emotional
image regions, possibly due to bottom-up saliency (inter-
action between saliency and emotion; AWS on emotional
= 0.59, non-emotional = 0.69). Models AWS, PQFT, and
HouNIPS outperform others over these stimuli. These mod-
els also performed well on non-emotional stimuli. Fig. 6
shows saliency maps of some emotional images.
Predicting scanpath: Not only humans are correlated in
terms of the locations they fixate, but they also agree some-
what in the order of their fixations [31, 32]. In the con-
text of saliency modeling, few models have aimed to pre-
dict scanpath sequence, partly due to difficulty in measuring

Citations
More filters
Journal ArticleDOI
TL;DR: A comprehensive review of recent progress in salient object detection is provided and this field is situate among other closely related areas such as generic scene segmentation, object proposal generation, and saliency for fixation prediction.
Abstract: Detecting and segmenting salient objects from natural scenes, often referred to as salient object detection, has attracted great interest in computer vision. While many models have been proposed and several applications have emerged, a deep understanding of achievements and issues remains lacking. We aim to provide a comprehensive review of recent progress in salient object detection and situate this field among other closely related areas such as generic scene segmentation, object proposal generation, and saliency for fixation prediction. Covering 228 publications, we survey i) roots, key concepts, and tasks, ii) core techniques and main modeling trends, and iii) datasets and evaluation metrics for salient object detection. We also discuss open problems such as evaluation metrics and dataset bias in model performance, and suggest future research directions.

608 citations


Cites methods from "Analysis of Scores, Datasets, and M..."

  • ...tion and salient object segmentation. 2.2 Models in Closely Related areas 2.2.1 Fixation Prediction Models Reviewing all fixation prediction models goes beyond the scope of this paper (See [46], [143]–[145] for reviews and benchmarks of these models). Here we give pointers to the most important trends and works in this domain. Inclusion of these models here is to measure their performance versus salient...

    [...]

  • ...t of fixation prediction models considered in this study. All of these models are based on pure low-level mechanisms and have shown to be very efficient in previous fixation prediction benchmarks [144], [145]. 2.2.2 Image Segmentation Models Segmentation is a fundamental problem studied in computer vision and usually adopted as a pre-process step to image analysis. Without any prior knowledge of the conte...

    [...]

Journal ArticleDOI
TL;DR: The underlying relationship among OD, SOD, and COD is revealed and some open questions are discussed as well as several unsolved challenges and promising future works are pointed out.
Abstract: Object detection, including objectness detection (OD), salient object detection (SOD), and category-specific object detection (COD), is one of the most fundamental yet challenging problems in the computer vision community. Over the last several decades, great efforts have been made by researchers to tackle this problem, due to its broad range of applications for other computer vision tasks such as activity or event recognition, content-based image retrieval and scene understanding, etc. While numerous methods have been presented in recent years, a comprehensive review for the proposed high-quality object detection techniques, especially for those based on advanced deep-learning techniques, is still lacking. To this end, this article delves into the recent progress in this research field, including 1) definitions, motivations, and tasks of each subdirection; 2) modern techniques and essential research trends; 3) benchmark data sets and evaluation metrics; and 4) comparisons and analysis of the experimental results. More importantly, we will reveal the underlying relationship among OD, SOD, and COD and discuss in detail some open questions as well as point out several unsolved challenges and promising future works.

564 citations


Cites methods from "Analysis of Scores, Datasets, and M..."

  • ...As mentioned by the previous studies [7], [43], and [44], in the branch of bottom-up SOD, approaches are to detect saliency under free viewing, which is automatically determined by the physical characteristics of the scene, while approaches in the other branch are to detect the task-driven saliency determined by the current goals of the observer....

    [...]

Journal ArticleDOI
TL;DR: This paper provides an analysis of 8 different evaluation metrics and their properties, and makes recommendations for metric selections under specific assumptions and for specific applications.
Abstract: How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.

526 citations


Cites background from "Analysis of Scores, Datasets, and M..."

  • ...The shuffled AUC metric, sAUC [8], [20], [73], [74], [85] samples negatives from fixation locations from other images, instead of uniformly at random....

    [...]

  • ...Dozens of computational saliency models are available to choose from [7], [8], [11], [12], [37], but objectively determining which model offers the “best” approximation to human eye fixations remains a challenge....

    [...]

  • ...Differences in how saliency and ground truth are represented and which attributes of saliency models should be rewarded/penalized leads to different choices of metrics for reporting performance [8], [12], [42],...

    [...]

  • ...[8] compared 32 saliency models with 3 metrics for fixation prediction and additional metrics for scanpath prediction on 4 datasets....

    [...]

  • ...Most eye-tracking datasets have been shown to be center biased, containing a larger number of fixations near the image center, across different image types, videos, and even observer tasks [7], [8], [14], [16], [33], [36]....

    [...]

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a convolutional long short-term memory (LSTM) network to iteratively refine the predicted saliency map by focusing on the most salient regions of the input image.
Abstract: Data-driven saliency has recently gained a lot of attention thanks to the use of convolutional neural networks for predicting gaze fixations. In this paper, we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of our solution is a convolutional long short-term memory that focuses on the most salient regions of the input image to iteratively refine the predicted saliency map. In addition, to tackle the center bias typical of human eye fixations, our model can learn a set of prior maps generated with Gaussian functions. We show, through an extensive evaluation, that the proposed architecture outperforms the current state-of-the-art on public saliency prediction datasets. We further study the contribution of each key component to demonstrate their robustness on different scenarios.

503 citations

Journal ArticleDOI
TL;DR: DeepFix as mentioned in this paper proposes a fully convolutional neural network (FCN) which models the bottom-up mechanism of visual attention via saliency prediction and predicts the saliency map in an end-to-end manner.
Abstract: Understanding and predicting the human visual attention mechanism is an active area of research in the fields of neuroscience and computer vision. In this paper, we propose DeepFix, a fully convolutional neural network, which models the bottom–up mechanism of visual attention via saliency prediction. Unlike classical works, which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts the saliency map in an end-to-end manner. DeepFix is designed to capture semantics at multiple scales while taking global context into account, by using network layers with very large receptive fields. Generally, fully convolutional nets are spatially invariant—this prevents them from modeling location-dependent patterns (e.g., centre-bias). Our network handles this by incorporating a novel location-biased convolutional layer. We evaluate our model on multiple challenging saliency data sets and show that it achieves the state-of-the-art results.

443 citations

References
More filters
Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
TL;DR: It is proved the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density.
Abstract: A general non-parametric technique is proposed for the analysis of a complex multimodal feature space and to delineate arbitrarily shaped clusters in it. The basic computational module of the technique is an old pattern recognition procedure: the mean shift. For discrete data, we prove the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density. The relation of the mean shift procedure to the Nadaraya-Watson estimator from kernel regression and the robust M-estimators; of location is also established. Algorithms for two low-level vision tasks discontinuity-preserving smoothing and image segmentation - are described as applications. In these algorithms, the only user-set parameter is the resolution of the analysis, and either gray-level or color images are accepted as input. Extensive experimental results illustrate their excellent performance.

11,727 citations


"Analysis of Scores, Datasets, and M..." refers background in this paper

  • ...The density function can be defined in terms of kernel K(x) with bandwidth h as follows [14] : f̂h,K(x) = ck,d nhd n ∑ i=1 K ( ‖x− xi h ‖(2) ) ....

    [...]

Journal ArticleDOI
TL;DR: A new hypothesis about the role of focused attention is proposed, which offers a new set of criteria for distinguishing separable from integral features and a new rationale for predicting which tasks will show attention limits and which will not.

11,452 citations


"Analysis of Scores, Datasets, and M..." refers background in this paper

  • ...What is the unit of attention? Do we attend to spatial locations, objects, or features? [5][27] A great deal of neurophysiological and behavioral evidence exists for all three....

    [...]

Journal ArticleDOI
TL;DR: In this article, a visual attention system inspired by the behavior and the neuronal architecture of the early primate visual system is presented, where multiscale image features are combined into a single topographical saliency map.
Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

10,525 citations

01 Jan 1998
TL;DR: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented, which breaks down the complex problem of scene understanding by rapidly selecting conspicuous locations to be analyzed in detail.

8,566 citations


Additional excerpts

  • ...In ITTI98, each feature map’s contribution to the saliency map is weighted by the squared difference between the globally most active location and the average activity of all other local maxima in the feature map [3]....

    [...]

  • ...[28] - - [32] [3] [33] [20] [6] [9] [10] [24] [7] [21] [11] [13] [13] [12] [8] [25] [23] [15] [18] [22] [2] [16] [14] [19] [17] [26] [34] [47] [38] Year - - - - 98 00 03 05 05 05 06 06 07 07 07 08 08 08 09 09 09 09 09 09 10 10 10 10 10 11 11 12 Code M M C C C C C C M M M S M M M M E M S M M M M M S M E E M 11 11 11 Category O O I I C C B C B/I I C G C S I I I C B C S S I P S G I C B B C I -...

    [...]

Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "Analysis of scores, datasets, and models in visual saliency prediction" ?

In this study, the authors pursue a critical and quantitative look at challenges ( e. g., center-bias, map smoothing ) in saliency modeling and the way they affect model accuracy. The authors find that, although model rankings vary, some ( e. g., AWS, LG, AIM, and HouNIPS ) consistently outperform other models over all datasets. The authors quantitatively compare 32 state-of-the-art models ( using the shuffled AUC score to discount center-bias ) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. Their benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling. 

The authors found that some stimulus categories are harder for models ( e. g., nature, nude, and portrait ) which warrant more attention in future works. Future directions: In this regard, it will also be interesting to test the feasibility of predicting whether a scene is natural or man-made from saliency and fixations. The authors believe it is important to constantly measure the gap between the IO model and models to find out in which directions models lag behind human performance. 

Visual stimuli used in neurophysiological and modeling works include: static (synthetic search arrays involving pop-out and conjunction search arrays, cartoons, or photographs) and over spatio-temporal dynamic stimuli (movies and interactive video games). 

Two important causes for CB are: (1) Viewing strategy where subjects start looking from the image center and (2) A perhaps stronger, photographer bias, which is the tendency of photographers to frame interesting objects at the center. 

A difficult challenge in fixation datasets which has affected fair model comparison is “Center-Bias (CB)”, whereby humans often appear to preferentially look near an image’s center [28]. 

But due to a lack of an exhaustive coherent benchmarking system, to address several issues such as evaluation measures (e.g., at least 4 types of AUC measures have been used; supplement), center-bias, map characteristics (e.g., smoothing), and dataset bias, a lot of inconsistencies still exist. 

In the context of saliency modeling, few models have aimed to predict scanpath sequence, partly due to difficulty in measuringand quantizing scanpaths. 

Fixation histogram is made by dividing the image into a grid pattern (16 × 16) and counting the number of fixations in each grid. 

To compute the histograms for a given image, the authors initially compute corresponding features (e.g., saccade velocity, etc.) for each observer and quantize the values into several bins. 

The authors believe it is important to constantly measure the gap between the IO model and models to find out in which directions models lag behind human performance. 

Properly tuning these parameters is important in fair model comparison and is perhaps best left for a model developer to optimize himself.