scispace - formally typeset
Open AccessProceedings ArticleDOI

Analysis of Scores, Datasets, and Models in Visual Saliency Prediction

TLDR
A critical and quantitative look at challenges in saliency modeling and the way they affect model accuracy is pursued, providing a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in Saliency modeling.
Abstract
Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scan path sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scan path sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.

read more

Content maybe subject to copyright    Report

Analysis of scores, datasets, and models in visual saliency prediction
Ali Borji
Hamed R. Tavakoli
+
Dicky N. Sihite
Laurent Itti
Department of Computer Science, University of Southern California, Los Angeles
+
Center for Machine Vision Research, University of Oulu, Finland
Abstract
Significant recent progress has been made in developing
high-quality saliency models. However, less effort has been
undertaken on fair assessment of these models, over large
standardized datasets and correctly addressing confound-
ing factors. In this study, we pursue a critical and quanti-
tative look at challenges (e.g., center-bias, map smoothing)
in saliency modeling and the way they affect model accu-
racy. We quantitatively compare 32 state-of-the-art mod-
els (using the shuffled AUC score to discount center-bias)
on 4 benchmark eye movement datasets, for prediction of
human fixation locations and scanpath sequence. We also
account for the role of map smoothing. We find that, al-
though model rankings vary, some (e.g., AWS, LG, AIM, and
HouNIPS) consistently outperform other models over all
datasets. Some models work well for prediction of both fix-
ation locations and scanpath sequence (e.g., Judd, GBVS).
Our results show low prediction accuracy for models over
emotional stimuli from the NUSEF dataset. Our last bench-
mark, for the first time, gauges the ability of models to de-
code the stimulus category from statistics of fixations, sac-
cades, and model saliency values at fixated locations. In
this test, ITTI and AIM models win over other models. Our
benchmark provides a comprehensive high-level picture of
the strengths and weaknesses of many popular models, and
suggests future research directions in saliency modeling.
1. Introduction
A large number of models has been proposed for predict-
ing where people look in scenes [1]. But due to a lack of an
exhaustive coherent benchmarking system, to address sev-
eral issues such as evaluation measures (e.g., at least 4 types
of AUC measures have been used; supplement), center-bias,
map characteristics (e.g., smoothing), and dataset bias, a lot
of inconsistencies still exist. The discrepancy of results in
previous works calls for a unified approach for gauging the
progress in this field and for fair comparison of models.
Importance. Modeling visual saliency broadens our un-
derstanding of a highly complex cognitive behavior, which
may lead to subsequent findings in other areas (object and
scene recognition, visual search, etc.) [27][5]. It also ben-
efits many engineering applications (e.g., object detection
and segmentation, content-aware image re-targeting, image
in-painting, visual tracking, image and video compression,
crowds analysis and social gaming [2][6][24][37][30], de-
termining the importance of objects in a scene [48, 44],
memorability of image regions [49], and object recall [50].
Our contributions. We offer 3 main contributions: (1) dis-
cussing current challenges and directions in saliency mod-
eling (evaluation metrics, dataset bias, model parameters,
etc.) and proposing solutions, (2) comparing 32 models and
their pros and cons in a unified quantitative framework over
4 widely-used datasets for fixation prediction (on classic
and emotional stimuli) as well as scanpath prediction, and
(3) stimuli/task decoding using saliency and fixation statis-
tics. Hopefully, our study will open new directions and con-
versations and help better organize the saliency literature.
Previous benchmarks. Few attempts has been made for
saliency model comparison, but their shortcomings have
driven us to conduct a new benchmark considering the lat-
est progress. Borji et al. [30] compared 36 models over
3 datasets. Judd et al. [41] compared 9 models over only
300 images. Some works have compared salient object de-
tection and region-of-interest algorithms [42]. Some other
benchmarks have compared models over applications such
as image quality assessment [52]. While being very effec-
tive, previous comparisons have not correctly addressed all
challenging parameters in model accuracy. For example,
map smoothing which influences fixation prediction accu-
racy of models [43], or center-bias (tendency of humans
to look towards the center of an image, such that a triv-
ial model that predicts salience near the center may sur-
pass other saliency models) [28, 25], have not been ad-
dressed in previous benchmarks. Here, we thoroughly in-
vestigate these shortcomings with additional comparison of
models over scanpath sequences. We provide the latest up-
date on saliency modeling, with the most comprehensive set
of models, challenges/parameters, datasets, and measures.
2. Basic concepts and definitions
Here, we lay out the ground for the rest of the paper and
explain some basic concepts of visual attention.
There is often confusion between saliency and atten-

tion. Saliency is a property of the perceived visual stimulus
(bottom-up (BU)) or, at most, of the features that the visual
system extracts from the stimulus (which can be manipu-
lated by top-down (TD) cues). Attention is a much more
general concept that depends on many cognitive factors of
very high-level such as strategy for image search and in-
teractions between saliency and search strategy, as well as
subjective factors such as age and experience.
A major distinction between mechanisms of visual atten-
tion is the bottom-up vs. top-down dissociation. Bottom-up
attention is reflexive, fast, likely feed-forward, and mainly
deployed by stimulus saliency (e.g., pop-out). On the other
hand, top-down attention is deliberative, slow, and powerful
with variable selection criteria depending on the task.
Previous studies of attention differ mainly based on the
type of stimuli they have employed and the tasks they have
addressed. Visual stimuli used in neurophysiological and
modeling works include: static (synthetic search arrays in-
volving pop-out and conjunction search arrays, cartoons,
or photographs) and over spatio-temporal dynamic stimuli
(movies and interactive video games). These stimuli have
been exploited for studying visual attention over three types
of tasks: (1) free viewing, (2) visual search, and (3) interac-
tive tasks (games or real-world tasks [54]).
What is the unit of attention? Do we attend to spatial
locations, objects, or features? [5][27] A great deal of neu-
rophysiological and behavioral evidence exists for all three.
Space-based theories claim that humans deliberately attend
to spatial locations where a target may appear. Similar ob-
servations indicate that visual attention is essentially guided
by recognized objects, with low-level saliency contributing
only indirectly [50]. Features are well explored by neural
studies showing that neurons accommodate their response
properties to render an object of interest more salient. Note
that these are not exclusive concepts.
A closely related field to saliency modeling is salient re-
gion detection. While the goal of the former is to predict
locations that grab attention, the latter attempts to segment
the most salient object or region in a scene. Evaluation is
often done by measuring precision-recall of saliency maps
of a model against ground truth data (explicit saliency judg-
ments of subjects by annotating salient objects or clicking
on locations). Some models in two categories have com-
pared themselves against each other, without being aware
of the distinction and different goals of the models.
The majority of models described by the above concepts
have focused on bottom-up, space-based, and static atten-
tion for explaining eye movements in free-viewing tasks.
3. Analysis of challenges and open problems
We discuss challenges that have emerged as more models
have been proposed. These are open issues not only for
research but also for performing fair model comparison.
0 200 400 600 800 1000
5
10
15
20
25
30
35
40
Le Meur(27,40)
DOVES(101,29)
Kootstra(99,31)
Number of images Distance from screen
Toronto(120,11)
FIFA(180,8)
NUSEF
(758,25)
MIT
(1003,15)
*
*
{
Kienzle(200,14)
Einhauser(93,8)
Engman(90,8)
Tatler(120,22)
Engelke(90,8)
Reinagel(77,5)
40 60 80 100 120 140
1
2
3
4
5
6
7
8
9
Kienzle
(60,3)
Einhauser
(80,3)
Toronto
(75,4)
MIT
(48,3)
FIFA
(80,2)
NUSEF
(43,5)
Rainagel
(79,4)
Kootstra
(70,3)
Tatler
(60,5)
Engman
(85,2)
DOVES
(134,5)
Engelke(60,8)
Number of subjects
Presentation time
Figure 1. Parameters of 13 major eye movement datasets. The MIT [2]
dataset is the largest one with 1003 images. It has a high degree of pho-
tographer bias and few number of eye tracking subjects. LeMeur [21] has
only 27 images with the highest number of eye-tracking subjects (40). The
Toronto dataset [10] has 120 images mainly indoor and in-city scenes. The
5 common objects in Judd dataset are human, face, text, animals, and cars.
Dataset bias. Available eye movement datasets vary on
several parameters, for instance: number of images, num-
ber of viewers, viewing time per image, subject’s distance
from the screen, and stimulus variety [55]. Fig. 1 shows
some popular eye tracking data sets, among them some are
publicly available. Due to small size of the Toronto dataset
and small number of subjects, its sole usage is less encour-
aged. Perhaps the best options so far are NUSEF [4] and
Judd [2] datasets. NUSEF dataset (758 images and 25 sub-
jects) contains a large number of affective stimuli making
it more suitable for studying semantic attentional cues. As
Fig. 1 shows larger fixation datasets with many images and
eye-tracking subjects are needed. Because of the specialty
of datasets (different optimal weights for features over dif-
ferent datasets [36]), a fair evaluation is to compare models
over several datasets as presented in Sec. 4. Further, it has
been shown that fixation density maps from different lab-
oratories differ significantly due to inter-laboratory differ-
ences and experimental conditions [56].
A difficult challenge in fixation datasets which has
affected fair model comparison is “Center-Bias (CB)”,
whereby humans often appear to preferentially look near an
image’s center [28]. Two important causes for CB are: (1)
Viewing strategy where subjects start looking from the im-
age center and (2) A perhaps stronger, photographer bias,
which is the tendency of photographers to frame interest-
ing objects at the center. Annoyingly, due to CB in data, a
trivial saliency model that just consists of a Gaussian blob
at the center of the image, often scores higher than almost
all saliency models [2]. This can be verified from the av-
erage eye fixation maps of 3 popular datasets (See supple-
ment). We observed higher central fixation densities for im-
ages with objects at the center compared with those with
objects off the center. Another problem that is in essence
similar to CB is handling invalid filter responses at image
borders (“border effect“, e.g., AIM model [10]; See [25]).
Some models explicitly (e.g., Judd) or implicitly (e.g.,
GBVS) have added center-bias (location prior) making fair
comparison challenging. Three possible remedies are: (1)

0
0.4
0.8
0
0.2
0.4
0.5
0
0.4
0.8
1
1
1
2
0
0.5
1
1.5
0.5
0.7
0.9
AUC
NSS
0
0.5
1
1.5
0.5
0.7
0.9
5 10 25 50 75 100 125 140
AUC
NSS
0.5 1 2 3 5 8 10 15
0.5 1 2 3 5 8 10 15
0
40
80
100
0.98
0.99
1.00
AUC
CC
CC
CC
NSS
0.5
1
2
3
5
8
10
15
σ
σ σ
Border Size
cc 1.0, auc 1.0, nss 5.3
cc −0.03, auc 0.4, nss −0.2
300 x 300
cc −0.01, auc 0.4, nss −0.07
mean eye position(MEP)fixations
Figure 2. Score analysis. 1
st
column: scores of a saliency map made by
placing a variable Gaussian (σ
1
) at fixated locations, 2
nd
column: scores
of the central Gaussian blob (σ
2
), and the 3
rd
column: scores of the image
with variable border size. Results are averaged over 1000 runs with 10
randomly generated fixations from a Gaussian distribution to mimic center-
bias [28] in data similar to heatmap in Fig. 3. Image size: [300 300].
Every model adds a central Gaussian. This adds Gaussian
size and its weight as two additional parameters, (2) Col-
lecting datasets with no CB. This is difficult since, even if
we have an approach to uniformly distribute image content,
viewing strategy still exists, and (3) Designing suitable met-
rics which we consider as the most reasonable approach.
Evaluation metrics. Traditionally, saliency models have
been evaluated against eye movement datasets. In some
cases, accuracy is whether one can predict what changes
people will notice, or what they will remember or anno-
tate [1]. We use three popular metrics for saliency eval-
uation: (1) Correlation Coefficient (CC) between a model
(s) and human (h) saliency maps: CC(s,h) =
cov(s,h)
σ
s
σ
h
,
(2) Normalized Scanpath Saliency (NSS): the average of
saliency values at n fixations in a normalized map (N SS =
1
n
P
n
i=1
s(x
i
h
,y
i
h
)µ
s
σ
s
)[51], and (3) Area Under the ROC
Curve (AUC) where human fixations are considered as the
positive set and some points from the image are uniformly
chosen as the negative set. The saliency map is then treated
as a binary classifier to separate the positive samples from
negatives. By thresholding over this map and plotting true
positive rate vs. false positive rate an ROC curve is achieved
and its underneath area is calculated. Please see supplement
for a subtle discussion of variations of AUC metrics. KL [9]
and earth mover distance (EMD) [36] measures have also
been used for model evaluation. Some studies have evalu-
ated the sequence of fixations in scanpath [32, 31].
Fig. 2 shows analysis of how the above scores are af-
fected by smoothness of the saliency map and possible cen-
ter bias in the reference data. We generated some random
eye fixations (sampled from a Gaussian distribution) and
made a saliency map by convolving it with a Gaussian filter
with variable sigma σ
1
. Shown in the 1st column, increas-
ing σ
1
, reduces all 3 scores. Over AUC, however, the drop
is moderate and the range is very small meaning as long as
the hit rates are high, the AUC is high regardless of the false
alarm rate [36]
1
. Shown in the 2nd column, we placed a
Gaussian at the center of the image and calculated the score
again by varying the σ
2
of the central Gaussian as well as σ
1
of the Gaussian convolved with fixations (only for CC since
for NSS and AUC, fixation positions are used). Increasing
σ
2
, raises 3 scores up to the maximum match between Gaus-
sian and MEP map and then drops or saturates. CC scores
are raised by increasing σ
1
. Third column in Fig. 2 shows
that by increasing the border size, scores reach a maximum
and then drop, a similar effect to center-bias. These analy-
ses show that smoothing the saliency maps and the size of
the central Gaussian affect scores and should be accounted
for fair model comparison. NSS score is more sensitive to
smoothing. All of these scores suffer from center-bias.
Two other issues regarding scores are sensitivity to map
normalization (a.k.a re-parameterization) and having well-
defined bounds (and chance level). Some scores are invari-
ant to continuous monotonic nonlinearity (e.g., KL) while
some others are not (CC, NSS, and AUC). All scores are
invariant to saliency map shifting and scaling. Some scores
have well defined bounds (CC and AUC have lower and up-
per bounds) while some do not (KL and NSS; KL has lower
bound and NSS has upper bound and chance level of 0).
A proper score for tackling CB is shuffled AUC
(sAUC) [25] with the only difference to AUC being that
instead of selecting negative points randomly, all fixations
over other images are used as the negative set. This score is
not affected by σ
2
and border size in Fig. 2. sAUC value for
a central Gaussian and a white map is near 0.5 (i.e, fixations
from other images as the negative set [28]). When using the
method in [25] (i.e., saliency from other images but at fix-
ations of the current image), this type of AUC leads to the
exact value of 0.5 for the central Gaussian (See Supp.).
Features for saliency detection. Traditionally, intensity,
orientation, and color (in LAB and RGB spaces) have been
used for saliency derivation over static images. For dynamic
scenes, flicker and motion features have been added. Fur-
thermore, several other low-level features have been used
to estimate saliency (size, depth, optical flow, etc.). High-
level features (prior knowledge) such as faces [2], peo-
ple [2], cars [2], symmetry [8], signs, and text [35] have
been also incorporated. One challenge is detecting affective
(emotional) features and semantic (high-level knowledge)
scene properties (e.g., causality, action-influence) which
have been suggested to be important in guiding attention
(location and fixation duration) [4]. Models usually use all
channels for all sorts of stimuli which makes them highly
dependent on the false positive rates of employed feature
1
Note that in [36] and [2], AUC is calculated by thresholding the
saliency map and then measuring hit rate which is different from what we
(and also [25, 28, 43]) do by spreading random points on the image.

Gauss=0.05 AUC=0.7174Gauss=0.04 AUC=0.7227Gauss=0.03 AUC=0.7272Gauss=0.02 AUC=0.7274Gauss=0.01 AUC=0.7286
Image
human map
Figure 3. A sample saliency map smoothed by convolving with a variable-size
Gaussian kernel (for the AWS model over an image of the Toronto dataset).
detectors (e.g., face or car detector). Since existing models
use linear features, they render highly textured regions more
salient. Non-linear features (e.g., famous egg in the nest or
birthday candle images [25]) has been proposed but has not
been fully implemented.
Parameters. Models often have several design parame-
ters such as the number and type of filters, choice of non-
linearities, within-feature and across-scale normalization
schemes, smoothing, and center-bias. Properly tuning these
parameters is important in fair model comparison and is per-
haps best left for a model developer to optimize himself.
4. Saliency benchmark
We chose four widely-used datasets for model compari-
son: Toronto [10], NUSEF [4], MIT [2], and Kootstra [8].
Table 1 shows 30 models compared here. Additionally,
we implemented two simple yet powerful models, to serve
as baselines: Gaussian Blob (Gauss) and Human inter-
observer (IO). Gaussian blob is simply a 2D Gaussian
shape drawn at the center of the image; it is expected to
predict human gaze well if such gaze is strongly clustered
around the center. For a given stimulus, the human model
outputs a map built by integrating fixations from other sub-
jects than the one under test while they watched that same
stimulus. The map is usually smoothed by convolving with
a Gaussian filter. This inter-observer model is expected to
provide an upper bound on prediction accuracy of compu-
tational models, to the extent that different humans may be
the best predictors of each other. We resized saliency maps
to the size of the original images onto which eye movements
have been recorded. Please note that, besides models com-
pared here, some other models may exist that might perform
well (e.g., [37]), but are not publicly available or easily ac-
cessible. We leave them for future investigations.
We first measure how well a model performs at pre-
dicting where people look over static image eye movement
datasets. We report results using sAUC score as it has sev-
eral advantages over others. Results over other scores are
shown in supplement. Note however, sAUC score alone is
not the only criterion for our conclusions as it gives more
credit to the off center information and favors true positives
more. Next, we compare models for their ability of predict-
ing the saccade sequence. Our conclusions are based on the
premise that if a model is good it should perform well over
all configurations (i.e., score, datasets, and parameters).
Predicting fixation locations: Model scores and average
ranks using sAUC over four datasets are shown in Table 1.
We smoothed saliency map of each model by convolving
it with a Gaussian kernel (Fig. 3). We then plotted the
sAUC of each model over the range of standard deviations
of the Gaussian kernel in image width (from 0.01 to 0.13 in
steps of 0.01) and calculated the maximum value over this
range for each model. Compared with our rankings with
original maps (supplement), now some models get a better
score. Although the ranking order is not the same over four
datasets, some general patterns are noticeable. The Gaus-
sian model is the worst (not significantly better than chance)
over all datasets as we expected. There is a significant dif-
ference between models and IO model. This difference is
more profound over NUSEF and MIT datasets as they con-
tain many stimuli with complex high-level concepts.
AWS model is significantly better than all other models
followed by LG, AIM, Global Rarity, Torralba, HouCVPR,
HouNIPS, SDSR, and Judd. Over the largest dataset (i.e,
MIT), AWS, LG, AIM, Torralba performed better than other
models. Over the NUSEF dataset, AIM, LG, Torralba,
HouCVPR, HouNIPS models did the best. Kootstra, STB,
ITTI (due to different normalization and map sparseness
than ITTI98), and Marat ranked at the bottom. Interestingly,
AWS model on the Kootstra dataset performs as good as the
human IO. Our analyses show that CC, NSS, and AUC pro-
duce very high scores for the Gaussian; almost better than
all models thanks to its center-preference (see supplement).
Therefore, we do not recommend using them for saliency
model comparison. Considering rankings sAUC, CC, and
NSS scores, we noticed that models that performed well
over sAUC are also ranked on top using other scores.
Fig. 4 shows model performance over stimulus cate-
gories of the NUSEF dataset for each model and average
over all models. There is no significant difference over dif-
ferent categories of stimuli averaged over all models (In-
set; See also supplement) although it seems that models
perform better over face stimuli and the worst over por-
trait and nude (this pattern is more clear considering only
0.4
0.45
0.5
0.55
0.6
0.65
AIM
AWS
Bian
Entropy
GBVS
G-Rarity
Tavakkoli
HouCVPR
HouNIPS
Shuffled AUC
ITTI
Judd
L-Rarity
PQFT
SDSR
SUN
Surprise
Torralba
Variance
STB
Gauss
Human
event
face
nude
other
portrait
event face nude other portrait
0.4
0.5
0.6
0.7
NUSEF - 412
Figure 4. Model performance over categories of the NUSEF dataset.
Gauss and Human IO are excluded from the average (i.e., inset). Num-
ber of images: Event: 36, Face: 52, Nude: 20, Other: 181, Portrait: 123.

Model
Gaussian-Blob
Inter-observer (IO)
Variance
Entropy
Itti et al. (ITTI98)
Itti et al. (ITTI)
Torralba
Vocus (Frintrop)
Surprise (Itti & Baldi)
AIM (Bruce & Tsotsos)
Saliency Toolbox (STB)
GBVS (Harel et al.)
Le Meur et al.
HouCVPR (Hou & Zhang)
Local Rarity (Mancas)
Global Rarity (Mancas)
HouNIPS (Hou & Zhang)
Kootstra et al.
SUN (Zhang et al.)
Marat et al.
PQFT (Guo et al.)
Yin Li et al.
SDSR (Seo & Milanfar)
Judd et al.
Bian et al.
ESaliency (Avraham et al.)
Yan et al.
AWS (Diaz et al.)
Jia Li et al.
Tavakoli et al.
Murray et al.
LG (Borji & Itti)
Avg. score over models
Ref. [28] - - [32] [3] [33] [20] [6] [9] [10] [24] [7] [21] [11] [13] [13] [12] [8] [25] [23] [15] [18] [22] [2] [16] [14] [19] [17] [26] [34] [47] [38] -
Year - - - - 98 00 03 05 05 05 06 06 07 07 07 08 08 08 09 09 09 09 09 09 10 10 10 10 10 11 11 12 -
Code M M C C C C C C M M M S M M M M E M S M M M M M S M E E M 11 11 11 -
Category O O I I C C B C B/I I C G C S I I I C B C S S I P S G I C B B C I -
Torronto .50 .73 .66 .65 .63 .62 .69 .66 .63 .69 .62 .65 .66 .69 .65 .69 .69 .61 .67 .64 .68 .69 .69 .68 .61 .65 .68 .72 .67 .64 .64 .70 .66
NUSEF .49 .66 .62 .61 .57 .56 .63 - .59 .64 .56 .59 - .63 .60 .62 .63 - .61 - .61 - .61 .61 .63 - - .64 - .56 .57 .63 .60
MIT .50 .75 .65 .64 .62 .61 .67 .65 .63 .68 .58 .64 .57 .65 .63 .67 .65 .60 .65 .62 .66 .65 .65 .66 .61 .62 .64 .69 - .65 .65 .68 .64
Kootstra .50 .62 .58 .57 .58 .57 .59 .60 .58 .59 .57 .56 .57 .59 .58 .61 .59 .56 .56 .54 .58 .59 .60 .59 .57 .56 .58 .62 .56 - - .59 .58
Avg Rank - - 4.8 5.8 7.3 8.3 3 4.7 6.8 2.5 8.8 6.5 8 3.5 6 2.8 3.5 9.3 5.3 8 4.3 4 3.8 4 7 5.8 5 1 6.7 6.5 6.7 2.5 -
Table 1. Compared visual saliency models. Abbreviations are: M: Matlab, C: C/C++, E: Executables, S: Sent saliency maps. Note that STB [24] and
VOCUS [6] are two implementations of the Itti et al. [3] model. Numbers are maximum shuffled AUC scores of models by optimizing the saliency map
smoothness (Fig. 3). See supplement for optimal sigma values of Gaussian kernel σ where models take their maximums (σ from 0.01 : 0.01 : 0.13 in image
width). Model category belongs to one of these categories [30]: Cognitive (C), Bayesian (B), Decision-theoretic (D), Information-theoretic (I), Graphical
(G), Spectral-analysis (S), Pattern-classification (P), Others (O). We observe lower performance for the STB model compared with either ITTI or ITTI98
models over free-viewing datasets. Thus, it’s use instead on the ITTI model (e.g., for model-based behavioral studies) is not encouraged. We employ two
different versions of the Itti et al. model: ITTI98 and ITTI, which correspond to different normalization schemes. In ITTI98, each feature map’s contribution
to the saliency map is weighted by the squared difference between the globally most active location and the average activity of all other local maxima in the
feature map [3]. This gives rise to smooth saliency maps, which tend to correlate better with noisy human eye movement data. In the ITTI model [33], the
spatial competition for saliency is much stronger, and is implemented in each feature map as 10 rounds of convolution by a large difference-of-Gaussians
followed by half-wave rectification. This gives rise to much sparser saliency maps, which are more useful than the ITTI98 maps when trying to decide on
the single next location to look at (e.g., in machine vision and robotics applications). Kootstra dataset is the hardest one for humans (low IO agreement)
and models. Next hardest in the NUSEF dataset. Note that symmetry model of Kootstra can not compete with the other models over the Kootstra dataset
although there are many images with symmetric objects in this dataset. Numbers are rounded to their closest value (See supplement for more accurate
values). Borji et al. [30], results are on original images while here we report optimized results over smoothed images. In our experiments here we used the
myGauss=fspecial(’gaussian’,50,10) which was then normalized to [0 1]. In principle, smoothing Gaussian should be about 1
2
of the visual field.
0 2 4 6 8 10 12 14
0.5
0.52
0.54
Shuffled AUC (sAUC) score
Smoothing parameter (Gaussian size)
0.56
0.58
0.6
AIM
SUN
AWS
GBVS
HouCVPR
HouNIPS
ITTI
Judd
PQFTt
Model
AIM
SUN
AWS
GBVS
HouCVPR
HouNIPS
ITTI
Judd
PQFT
Gauss
0.585
0.587
0.598
0.569
0.588
0.584
0.531
0.543
0.591
0.50
max sAUC
score
Figure 5. sAUC scores over emotional images of the NUSEF dataset.
Results on the right-hand table are the maxima over smoothing range.
top-performing models). The AWS model did the best over
all categories. HouNIPS, Judd, SDSR, Yan, and AIM also
ranked at the top. Faces are often located at the center while
nude and event stimuli are mostly off-center. Humans are
more correlated for portrait, event, and nude stimuli. A sep-
arate analysis over the Kootstra dataset showed that mod-
els have difficulty in saliency detection over nature stim-
uli where there are less distinctive and salient objects (See
supplement). This means that much progress remains to be
done in saliency detection over stimuli containing concep-
tual stimuli (e.g, images containing interacting objects, ac-
tions such as grasping, living vs. non-living, object regions
inside a bigger object i.e., faces, body parts, etc).
Behavioral studies have shown that affective stimuli in-
Mickey, positive, valence = 7.4Wolf, neutral, valence = 4.21Elderlywoman, negative, valence = 3.26 Watermelon, positive, valence = 7.04Firehydrant, neutral, valence = 5.24Harassment, negative, valence = 3.19
Img No: 4621 Img No: 2590 Img No: 1302 Img No: 7100
Img No: 1999 Img No: 7325
Figure 6. Sample emotional images with positive, negative, and neutral
emotional valence from NUSEF dataset along with saliency maps of the
AWS model. Note that in some cases saliency misses fixations.
fluence the way we look at images. Humphrey et al. [45]
showed that initial fixations were more likely to be on emo-
tional objects than more visually salient neutral ones. Here
we take a close look at model differences over emotional
(affective) stimuli, using 287 images from the NUSEF be-
longing to the IAPS dataset [46]. Fig. 5 shows sAUC
scores of 10 models over affective stimuli. These val-
ues are smaller than the ones for NUSEF (non-affective)
shown in Table 1. Our results (using shuffled AUC and
with smoothing similar to Table 1; see supplement) sug-
gest that only a fraction of fixations landed on emotional
image regions, possibly due to bottom-up saliency (inter-
action between saliency and emotion; AWS on emotional
= 0.59, non-emotional = 0.69). Models AWS, PQFT, and
HouNIPS outperform others over these stimuli. These mod-
els also performed well on non-emotional stimuli. Fig. 6
shows saliency maps of some emotional images.
Predicting scanpath: Not only humans are correlated in
terms of the locations they fixate, but they also agree some-
what in the order of their fixations [31, 32]. In the con-
text of saliency modeling, few models have aimed to pre-
dict scanpath sequence, partly due to difficulty in measuring

Figures
Citations
More filters
Journal ArticleDOI

Salient Object Detection: A Survey

TL;DR: A comprehensive review of recent progress in salient object detection is provided and this field is situate among other closely related areas such as generic scene segmentation, object proposal generation, and saliency for fixation prediction.
Journal ArticleDOI

Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey

TL;DR: The underlying relationship among OD, SOD, and COD is revealed and some open questions are discussed as well as several unsolved challenges and promising future works are pointed out.
Journal ArticleDOI

What Do Different Evaluation Metrics Tell Us About Saliency Models

TL;DR: This paper provides an analysis of 8 different evaluation metrics and their properties, and makes recommendations for metric selections under specific assumptions and for specific applications.
Journal ArticleDOI

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.

TL;DR: Zhang et al. as mentioned in this paper proposed a convolutional long short-term memory (LSTM) network to iteratively refine the predicted saliency map by focusing on the most salient regions of the input image.
Journal ArticleDOI

DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations

TL;DR: DeepFix as mentioned in this paper proposes a fully convolutional neural network (FCN) which models the bottom-up mechanism of visual attention via saliency prediction and predicts the saliency map in an end-to-end manner.
References
More filters
Proceedings ArticleDOI

Boosting bottom-up and top-down visual features for saliency estimation

TL;DR: The boosting model outperforms 27 state-of-the-art models and is so far the closest model to the accuracy of human model for fixation prediction, and successfully detects the most salient object in a scene without sophisticated image processings such as region segmentation.
Journal ArticleDOI

Interesting objects are visually salient.

TL;DR: The results indicate that selecting interesting objects in a scene is largely constrained by low-level visual properties rather than solely determined by higher cognitive processes.
Journal ArticleDOI

Yarbus, eye movements, and vision

TL;DR: A brief biography of Yarbus is provided and his impact on contemporary approaches to research on eye movements, including interest in his work on the cognitive influences on scanning patterns is assessed.
Book

VOCUS: A Visual Attention System for Object Detection and Goal-Directed Search (Lecture Notes in Computer Science / Lecture Notes in Artificial Intelligence)

TL;DR: In this paper, a biologically motivated computational attention system VOCUS (Visual Object detection with a Computational Attention System) is proposed to detect regions of interest in images, which are defined by strong contrasts (e.g., color or intensity contrasts) and by the uniqueness of a feature.
Journal ArticleDOI

Learning a saliency map using fixated locations in natural scenes.

TL;DR: A least square technique is used to learn the weights associated with these maps from subjects freely fixating natural scenes drawn from four recent eye-tracking data sets, and this model outperforms several state-of-the-art saliency algorithms.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "Analysis of scores, datasets, and models in visual saliency prediction" ?

In this study, the authors pursue a critical and quantitative look at challenges ( e. g., center-bias, map smoothing ) in saliency modeling and the way they affect model accuracy. The authors find that, although model rankings vary, some ( e. g., AWS, LG, AIM, and HouNIPS ) consistently outperform other models over all datasets. The authors quantitatively compare 32 state-of-the-art models ( using the shuffled AUC score to discount center-bias ) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. Their benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling. 

The authors found that some stimulus categories are harder for models ( e. g., nature, nude, and portrait ) which warrant more attention in future works. Future directions: In this regard, it will also be interesting to test the feasibility of predicting whether a scene is natural or man-made from saliency and fixations. The authors believe it is important to constantly measure the gap between the IO model and models to find out in which directions models lag behind human performance. 

Visual stimuli used in neurophysiological and modeling works include: static (synthetic search arrays involving pop-out and conjunction search arrays, cartoons, or photographs) and over spatio-temporal dynamic stimuli (movies and interactive video games). 

Two important causes for CB are: (1) Viewing strategy where subjects start looking from the image center and (2) A perhaps stronger, photographer bias, which is the tendency of photographers to frame interesting objects at the center. 

A difficult challenge in fixation datasets which has affected fair model comparison is “Center-Bias (CB)”, whereby humans often appear to preferentially look near an image’s center [28]. 

But due to a lack of an exhaustive coherent benchmarking system, to address several issues such as evaluation measures (e.g., at least 4 types of AUC measures have been used; supplement), center-bias, map characteristics (e.g., smoothing), and dataset bias, a lot of inconsistencies still exist. 

In the context of saliency modeling, few models have aimed to predict scanpath sequence, partly due to difficulty in measuringand quantizing scanpaths. 

Fixation histogram is made by dividing the image into a grid pattern (16 × 16) and counting the number of fixations in each grid. 

To compute the histograms for a given image, the authors initially compute corresponding features (e.g., saccade velocity, etc.) for each observer and quantize the values into several bins. 

The authors believe it is important to constantly measure the gap between the IO model and models to find out in which directions models lag behind human performance. 

Properly tuning these parameters is important in fair model comparison and is perhaps best left for a model developer to optimize himself.