What future works have the authors mentioned in the paper "Analysis of scores, datasets, and models in visual saliency prediction" ?

The authors found that some stimulus categories are harder for models ( e. g., nature, nude, and portrait ) which warrant more attention in future works. Future directions: In this regard, it will also be interesting to test the feasibility of predicting whether a scene is natural or man-made from saliency and fixations. The authors believe it is important to constantly measure the gap between the IO model and models to find out in which directions models lag behind human performance.

How do the authors make a fixation histogram?

Fixation histogram is made by dividing the image into a grid pattern (16 × 16) and counting the number of fixations in each grid.

What is the way to compute the histograms for a given image?

To compute the histograms for a given image, the authors initially compute corresponding features (e.g., saccade velocity, etc.) for each observer and quantize the values into several bins.

Why do the authors believe it is important to measure the gap between the IO model and models?

The authors believe it is important to constantly measure the gap between the IO model and models to find out in which directions models lag behind human performance.

(Open Access) Analysis of Scores, Datasets, and Models in Visual Saliency Prediction (2013) | Ali Borji

Q: What contributions have the authors mentioned in the paper "Analysis of scores, datasets, and models in visual saliency prediction" ?

In this study, the authors pursue a critical and quantitative look at challenges ( e. g., center-bias, map smoothing ) in saliency modeling and the way they affect model accuracy. The authors find that, although model rankings vary, some ( e. g., AWS, LG, AIM, and HouNIPS ) consistently outperform other models over all datasets. The authors quantitatively compare 32 state-of-the-art models ( using the shuffled AUC score to discount center-bias ) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. Their benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.

Q: What is the difficult challenge in the fixation datasets?

A difficult challenge in fixation datasets which has affected fair model comparison is “Center-Bias (CB)”, whereby humans often appear to preferentially look near an image’s center [28].

Q: What is the reason for the lack of models to predict scanpath sequence?

In the context of saliency modeling, few models have aimed to predict scanpath sequence, partly due to difficulty in measuringand quantizing scanpaths.

Analysis of scores, datasets, and models in visual saliency prediction

Ali Borji

†

Hamed R. Tavakoli

Dicky N. Sihite

†

Laurent Itti

†

Department of Computer Science, University of Southern California, Los Angeles

Center for Machine Vision Research, University of Oulu, Finland

Abstract

Signiﬁcant recent progress has been made in developing

high-quality saliency models. However, less effort has been

undertaken on fair assessment of these models, over large

standardized datasets and correctly addressing confound-

ing factors. In this study, we pursue a critical and quanti-

tative look at challenges (e.g., center-bias, map smoothing)

in saliency modeling and the way they affect model accu-

racy. We quantitatively compare 32 state-of-the-art mod-

els (using the shufﬂed AUC score to discount center-bias)

on 4 benchmark eye movement datasets, for prediction of

human ﬁxation locations and scanpath sequence. We also

account for the role of map smoothing. We ﬁnd that, al-

though model rankings vary, some (e.g., AWS, LG, AIM, and

HouNIPS) consistently outperform other models over all

datasets. Some models work well for prediction of both ﬁx-

ation locations and scanpath sequence (e.g., Judd, GBVS).

Our results show low prediction accuracy for models over

emotional stimuli from the NUSEF dataset. Our last bench-

mark, for the ﬁrst time, gauges the ability of models to de-

code the stimulus category from statistics of ﬁxations, sac-

cades, and model saliency values at ﬁxated locations. In

this test, ITTI and AIM models win over other models. Our

benchmark provides a comprehensive high-level picture of

the strengths and weaknesses of many popular models, and

suggests future research directions in saliency modeling.

1. Introduction

A large number of models has been proposed for predict-

ing where people look in scenes [1]. But due to a lack of an

exhaustive coherent benchmarking system, to address sev-

eral issues such as evaluation measures (e.g., at least 4 types

of AUC measures have been used; supplement), center-bias,

map characteristics (e.g., smoothing), and dataset bias, a lot

of inconsistencies still exist. The discrepancy of results in

previous works calls for a uniﬁed approach for gauging the

progress in this ﬁeld and for fair comparison of models.

Importance. Modeling visual saliency broadens our un-

derstanding of a highly complex cognitive behavior, which

may lead to subsequent ﬁndings in other areas (object and

scene recognition, visual search, etc.) [27][5]. It also ben-

eﬁts many engineering applications (e.g., object detection

and segmentation, content-aware image re-targeting, image

in-painting, visual tracking, image and video compression,

crowds analysis and social gaming [2][6][24][37][30], de-

termining the importance of objects in a scene [48, 44],

memorability of image regions [49], and object recall [50].

Our contributions. We offer 3 main contributions: (1) dis-

cussing current challenges and directions in saliency mod-

eling (evaluation metrics, dataset bias, model parameters,

etc.) and proposing solutions, (2) comparing 32 models and

their pros and cons in a uniﬁed quantitative framework over

4 widely-used datasets for ﬁxation prediction (on classic

and emotional stimuli) as well as scanpath prediction, and

(3) stimuli/task decoding using saliency and ﬁxation statis-

tics. Hopefully, our study will open new directions and con-

versations and help better organize the saliency literature.

Previous benchmarks. Few attempts has been made for

saliency model comparison, but their shortcomings have

driven us to conduct a new benchmark considering the lat-

est progress. Borji et al. [30] compared 36 models over

3 datasets. Judd et al. [41] compared 9 models over only

300 images. Some works have compared salient object de-

tection and region-of-interest algorithms [42]. Some other

benchmarks have compared models over applications such

as image quality assessment [52]. While being very effec-

tive, previous comparisons have not correctly addressed all

challenging parameters in model accuracy. For example,

map smoothing which inﬂuences ﬁxation prediction accu-

racy of models [43], or center-bias (tendency of humans

to look towards the center of an image, such that a triv-

ial model that predicts salience near the center may sur-

pass other saliency models) [28, 25], have not been ad-

dressed in previous benchmarks. Here, we thoroughly in-

vestigate these shortcomings with additional comparison of

models over scanpath sequences. We provide the latest up-

date on saliency modeling, with the most comprehensive set

of models, challenges/parameters, datasets, and measures.

2. Basic concepts and deﬁnitions

Here, we lay out the ground for the rest of the paper and

explain some basic concepts of visual attention.

There is often confusion between saliency and atten-

tion. Saliency is a property of the perceived visual stimulus

(bottom-up (BU)) or, at most, of the features that the visual

system extracts from the stimulus (which can be manipu-

lated by top-down (TD) cues). Attention is a much more

general concept that depends on many cognitive factors of

very high-level such as strategy for image search and in-

teractions between saliency and search strategy, as well as

subjective factors such as age and experience.

A major distinction between mechanisms of visual atten-

tion is the bottom-up vs. top-down dissociation. Bottom-up

attention is reﬂexive, fast, likely feed-forward, and mainly

deployed by stimulus saliency (e.g., pop-out). On the other

hand, top-down attention is deliberative, slow, and powerful

with variable selection criteria depending on the task.

Previous studies of attention differ mainly based on the

type of stimuli they have employed and the tasks they have

addressed. Visual stimuli used in neurophysiological and

modeling works include: static (synthetic search arrays in-

volving pop-out and conjunction search arrays, cartoons,

or photographs) and over spatio-temporal dynamic stimuli

(movies and interactive video games). These stimuli have

been exploited for studying visual attention over three types

of tasks: (1) free viewing, (2) visual search, and (3) interac-

tive tasks (games or real-world tasks [54]).

What is the unit of attention? Do we attend to spatial

locations, objects, or features? [5][27] A great deal of neu-

rophysiological and behavioral evidence exists for all three.

Space-based theories claim that humans deliberately attend

to spatial locations where a target may appear. Similar ob-

servations indicate that visual attention is essentially guided

by recognized objects, with low-level saliency contributing

only indirectly [50]. Features are well explored by neural

studies showing that neurons accommodate their response

properties to render an object of interest more salient. Note

that these are not exclusive concepts.

A closely related ﬁeld to saliency modeling is salient re-

gion detection. While the goal of the former is to predict

locations that grab attention, the latter attempts to segment

the most salient object or region in a scene. Evaluation is

often done by measuring precision-recall of saliency maps

of a model against ground truth data (explicit saliency judg-

ments of subjects by annotating salient objects or clicking

on locations). Some models in two categories have com-

pared themselves against each other, without being aware

of the distinction and different goals of the models.

The majority of models described by the above concepts

have focused on bottom-up, space-based, and static atten-

tion for explaining eye movements in free-viewing tasks.

3. Analysis of challenges and open problems

We discuss challenges that have emerged as more models

have been proposed. These are open issues not only for

research but also for performing fair model comparison.

0 200 400 600 800 1000

Le Meur(27,40)

DOVES(101,29)

Kootstra(99,31)

Number of images Distance from screen

Toronto(120,11)

FIFA(180,8)

NUSEF

(758,25)

MIT

(1003,15)

{

Kienzle(200,14)

Einhauser(93,8)

Engman(90,8)

Tatler(120,22)

Engelke(90,8)

Reinagel(77,5)

40 60 80 100 120 140

Kienzle

(60,3)

Einhauser

(80,3)

Toronto

(75,4)

MIT

(48,3)

FIFA

(80,2)

NUSEF

(43,5)

Rainagel

(79,4)

Kootstra

(70,3)

Tatler

(60,5)

Engman

(85,2)

DOVES

(134,5)

Engelke(60,8)

Number of subjects

Presentation time

Figure 1. Parameters of 13 major eye movement datasets. The MIT [2]

dataset is the largest one with 1003 images. It has a high degree of pho-

tographer bias and few number of eye tracking subjects. LeMeur [21] has

only 27 images with the highest number of eye-tracking subjects (40). The

Toronto dataset [10] has 120 images mainly indoor and in-city scenes. The

5 common objects in Judd dataset are human, face, text, animals, and cars.

Dataset bias. Available eye movement datasets vary on

several parameters, for instance: number of images, num-

ber of viewers, viewing time per image, subject’s distance

from the screen, and stimulus variety [55]. Fig. 1 shows

some popular eye tracking data sets, among them some are

publicly available. Due to small size of the Toronto dataset

and small number of subjects, its sole usage is less encour-

aged. Perhaps the best options so far are NUSEF [4] and

Judd [2] datasets. NUSEF dataset (758 images and 25 sub-

jects) contains a large number of affective stimuli making

it more suitable for studying semantic attentional cues. As

Fig. 1 shows larger ﬁxation datasets with many images and

eye-tracking subjects are needed. Because of the specialty

of datasets (different optimal weights for features over dif-

ferent datasets [36]), a fair evaluation is to compare models

over several datasets as presented in Sec. 4. Further, it has

been shown that ﬁxation density maps from different lab-

oratories differ signiﬁcantly due to inter-laboratory differ-

ences and experimental conditions [56].

A difﬁcult challenge in ﬁxation datasets which has

affected fair model comparison is “Center-Bias (CB)”,

whereby humans often appear to preferentially look near an

image’s center [28]. Two important causes for CB are: (1)

Viewing strategy where subjects start looking from the im-

age center and (2) A perhaps stronger, photographer bias,

which is the tendency of photographers to frame interest-

ing objects at the center. Annoyingly, due to CB in data, a

trivial saliency model that just consists of a Gaussian blob

at the center of the image, often scores higher than almost

all saliency models [2]. This can be veriﬁed from the av-

erage eye ﬁxation maps of 3 popular datasets (See supple-

ment). We observed higher central ﬁxation densities for im-

ages with objects at the center compared with those with

objects off the center. Another problem that is in essence

similar to CB is handling invalid ﬁlter responses at image

borders (“border effect“, e.g., AIM model [10]; See [25]).

Some models explicitly (e.g., Judd) or implicitly (e.g.,

GBVS) have added center-bias (location prior) making fair

comparison challenging. Three possible remedies are: (1)

0.4

0.8

0.2

0.4

0.5

0.4

0.8

0.5

1.5

0.5

0.7

0.9

AUC

NSS

0.5

1.5

0.5

0.7

0.9

5 10 25 50 75 100 125 140

AUC

NSS

0.5 1 2 3 5 8 10 15

100

0.98

0.99

1.00

AUC

NSS

0.5

σ σ

Border Size

cc 1.0, auc 1.0, nss 5.3

cc −0.03, auc 0.4, nss −0.2

300 x 300

cc −0.01, auc 0.4, nss −0.07

mean eye position(MEP)fixations

Figure 2. Score analysis. 1

column: scores of a saliency map made by

placing a variable Gaussian (σ

) at ﬁxated locations, 2

column: scores

of the central Gaussian blob (σ

), and the 3

column: scores of the image

with variable border size. Results are averaged over 1000 runs with 10

randomly generated ﬁxations from a Gaussian distribution to mimic center-

bias [28] in data similar to heatmap in Fig. 3. Image size: [300 300].

Every model adds a central Gaussian. This adds Gaussian

size and its weight as two additional parameters, (2) Col-

lecting datasets with no CB. This is difﬁcult since, even if

we have an approach to uniformly distribute image content,

viewing strategy still exists, and (3) Designing suitable met-

rics which we consider as the most reasonable approach.

Evaluation metrics. Traditionally, saliency models have

been evaluated against eye movement datasets. In some

cases, accuracy is whether one can predict what changes

people will notice, or what they will remember or anno-

tate [1]. We use three popular metrics for saliency eval-

uation: (1) Correlation Coefﬁcient (CC) between a model

(s) and human (h) saliency maps: CC(s,h) =

cov(s,h)

(2) Normalized Scanpath Saliency (NSS): the average of

saliency values at n ﬁxations in a normalized map (N SS =

i=1

s(x

)−µ

)[51], and (3) Area Under the ROC

Curve (AUC) where human ﬁxations are considered as the

positive set and some points from the image are uniformly

chosen as the negative set. The saliency map is then treated

as a binary classiﬁer to separate the positive samples from

negatives. By thresholding over this map and plotting true

positive rate vs. false positive rate an ROC curve is achieved

and its underneath area is calculated. Please see supplement

for a subtle discussion of variations of AUC metrics. KL [9]

and earth mover distance (EMD) [36] measures have also

been used for model evaluation. Some studies have evalu-

ated the sequence of ﬁxations in scanpath [32, 31].

Fig. 2 shows analysis of how the above scores are af-

fected by smoothness of the saliency map and possible cen-

ter bias in the reference data. We generated some random

eye ﬁxations (sampled from a Gaussian distribution) and

made a saliency map by convolving it with a Gaussian ﬁlter

with variable sigma σ

. Shown in the 1st column, increas-

ing σ

, reduces all 3 scores. Over AUC, however, the drop

is moderate and the range is very small meaning as long as

the hit rates are high, the AUC is high regardless of the false

alarm rate [36]

. Shown in the 2nd column, we placed a

Gaussian at the center of the image and calculated the score

again by varying the σ

of the central Gaussian as well as σ

of the Gaussian convolved with ﬁxations (only for CC since

for NSS and AUC, ﬁxation positions are used). Increasing

, raises 3 scores up to the maximum match between Gaus-

sian and MEP map and then drops or saturates. CC scores

are raised by increasing σ

. Third column in Fig. 2 shows

that by increasing the border size, scores reach a maximum

and then drop, a similar effect to center-bias. These analy-

ses show that smoothing the saliency maps and the size of

the central Gaussian affect scores and should be accounted

for fair model comparison. NSS score is more sensitive to

smoothing. All of these scores suffer from center-bias.

Two other issues regarding scores are sensitivity to map

normalization (a.k.a re-parameterization) and having well-

deﬁned bounds (and chance level). Some scores are invari-

ant to continuous monotonic nonlinearity (e.g., KL) while

some others are not (CC, NSS, and AUC). All scores are

invariant to saliency map shifting and scaling. Some scores

have well deﬁned bounds (CC and AUC have lower and up-

per bounds) while some do not (KL and NSS; KL has lower

bound and NSS has upper bound and chance level of 0).

A proper score for tackling CB is shufﬂed AUC

(sAUC) [25] with the only difference to AUC being that

instead of selecting negative points randomly, all ﬁxations

over other images are used as the negative set. This score is

not affected by σ

and border size in Fig. 2. sAUC value for

a central Gaussian and a white map is near 0.5 (i.e, ﬁxations

from other images as the negative set [28]). When using the

method in [25] (i.e., saliency from other images but at ﬁx-

ations of the current image), this type of AUC leads to the

exact value of 0.5 for the central Gaussian (See Supp.).

Features for saliency detection. Traditionally, intensity,

orientation, and color (in LAB and RGB spaces) have been

used for saliency derivation over static images. For dynamic

scenes, ﬂicker and motion features have been added. Fur-

thermore, several other low-level features have been used

to estimate saliency (size, depth, optical ﬂow, etc.). High-

level features (prior knowledge) such as faces [2], peo-

ple [2], cars [2], symmetry [8], signs, and text [35] have

been also incorporated. One challenge is detecting affective

(emotional) features and semantic (high-level knowledge)

scene properties (e.g., causality, action-inﬂuence) which

have been suggested to be important in guiding attention

(location and ﬁxation duration) [4]. Models usually use all

channels for all sorts of stimuli which makes them highly

dependent on the false positive rates of employed feature

Note that in [36] and [2], AUC is calculated by thresholding the

saliency map and then measuring hit rate which is different from what we

(and also [25, 28, 43]) do by spreading random points on the image.

Gauss=0.05 AUC=0.7174Gauss=0.04 AUC=0.7227Gauss=0.03 AUC=0.7272Gauss=0.02 AUC=0.7274Gauss=0.01 AUC=0.7286

Image

human map

Figure 3. A sample saliency map smoothed by convolving with a variable-size

Gaussian kernel (for the AWS model over an image of the Toronto dataset).

detectors (e.g., face or car detector). Since existing models

use linear features, they render highly textured regions more

salient. Non-linear features (e.g., famous egg in the nest or

birthday candle images [25]) has been proposed but has not

been fully implemented.

Parameters. Models often have several design parame-

ters such as the number and type of ﬁlters, choice of non-

linearities, within-feature and across-scale normalization

schemes, smoothing, and center-bias. Properly tuning these

parameters is important in fair model comparison and is per-

haps best left for a model developer to optimize himself.

4. Saliency benchmark

We chose four widely-used datasets for model compari-

son: Toronto [10], NUSEF [4], MIT [2], and Kootstra [8].

Table 1 shows 30 models compared here. Additionally,

we implemented two simple yet powerful models, to serve

as baselines: Gaussian Blob (Gauss) and Human inter-

observer (IO). Gaussian blob is simply a 2D Gaussian

shape drawn at the center of the image; it is expected to

predict human gaze well if such gaze is strongly clustered

around the center. For a given stimulus, the human model

outputs a map built by integrating ﬁxations from other sub-

jects than the one under test while they watched that same

stimulus. The map is usually smoothed by convolving with

a Gaussian ﬁlter. This inter-observer model is expected to

provide an upper bound on prediction accuracy of compu-

tational models, to the extent that different humans may be

the best predictors of each other. We resized saliency maps

to the size of the original images onto which eye movements

have been recorded. Please note that, besides models com-

pared here, some other models may exist that might perform

well (e.g., [37]), but are not publicly available or easily ac-

cessible. We leave them for future investigations.

We ﬁrst measure how well a model performs at pre-

dicting where people look over static image eye movement

datasets. We report results using sAUC score as it has sev-

eral advantages over others. Results over other scores are

shown in supplement. Note however, sAUC score alone is

not the only criterion for our conclusions as it gives more

credit to the off center information and favors true positives

more. Next, we compare models for their ability of predict-

ing the saccade sequence. Our conclusions are based on the

premise that if a model is good it should perform well over

all conﬁgurations (i.e., score, datasets, and parameters).

Predicting ﬁxation locations: Model scores and average

ranks using sAUC over four datasets are shown in Table 1.

We smoothed saliency map of each model by convolving

it with a Gaussian kernel (Fig. 3). We then plotted the

sAUC of each model over the range of standard deviations

of the Gaussian kernel in image width (from 0.01 to 0.13 in

steps of 0.01) and calculated the maximum value over this

range for each model. Compared with our rankings with

original maps (supplement), now some models get a better

score. Although the ranking order is not the same over four

datasets, some general patterns are noticeable. The Gaus-

sian model is the worst (not signiﬁcantly better than chance)

over all datasets as we expected. There is a signiﬁcant dif-

ference between models and IO model. This difference is

more profound over NUSEF and MIT datasets as they con-

tain many stimuli with complex high-level concepts.

AWS model is signiﬁcantly better than all other models

followed by LG, AIM, Global Rarity, Torralba, HouCVPR,

HouNIPS, SDSR, and Judd. Over the largest dataset (i.e,

MIT), AWS, LG, AIM, Torralba performed better than other

models. Over the NUSEF dataset, AIM, LG, Torralba,

HouCVPR, HouNIPS models did the best. Kootstra, STB,

ITTI (due to different normalization and map sparseness

than ITTI98), and Marat ranked at the bottom. Interestingly,

AWS model on the Kootstra dataset performs as good as the

human IO. Our analyses show that CC, NSS, and AUC pro-

duce very high scores for the Gaussian; almost better than

all models thanks to its center-preference (see supplement).

Therefore, we do not recommend using them for saliency

model comparison. Considering rankings sAUC, CC, and

NSS scores, we noticed that models that performed well

over sAUC are also ranked on top using other scores.

Fig. 4 shows model performance over stimulus cate-

gories of the NUSEF dataset for each model and average

over all models. There is no signiﬁcant difference over dif-

ferent categories of stimuli averaged over all models (In-

set; See also supplement) although it seems that models

perform better over face stimuli and the worst over por-

trait and nude (this pattern is more clear considering only

0.4

0.45

0.5

0.55

0.6

0.65

AIM

AWS

Bian

Entropy

GBVS

G-Rarity

Tavakkoli

HouCVPR

HouNIPS

Shuffled AUC

ITTI

Judd

L-Rarity

PQFT

SDSR

SUN

Surprise

Torralba

Variance

STB

Gauss

Human

event

face

nude

other

portrait

event face nude other portrait

0.4

0.5

0.6

0.7

NUSEF - 412

Figure 4. Model performance over categories of the NUSEF dataset.

Gauss and Human IO are excluded from the average (i.e., inset). Num-

ber of images: Event: 36, Face: 52, Nude: 20, Other: 181, Portrait: 123.

Model

Gaussian-Blob

Inter-observer (IO)

Variance

Entropy

Itti et al. (ITTI98)

Itti et al. (ITTI)

Torralba

Vocus (Frintrop)

Surprise (Itti & Baldi)

AIM (Bruce & Tsotsos)

Saliency Toolbox (STB)

GBVS (Harel et al.)

Le Meur et al.

HouCVPR (Hou & Zhang)

Local Rarity (Mancas)

Global Rarity (Mancas)

HouNIPS (Hou & Zhang)

Kootstra et al.

SUN (Zhang et al.)

Marat et al.

PQFT (Guo et al.)

Yin Li et al.

SDSR (Seo & Milanfar)

Judd et al.

Bian et al.

ESaliency (Avraham et al.)

Yan et al.

AWS (Diaz et al.)

Jia Li et al.

Tavakoli et al.

Murray et al.

LG (Borji & Itti)

Avg. score over models

Ref. [28] - - [32] [3] [33] [20] [6] [9] [10] [24] [7] [21] [11] [13] [13] [12] [8] [25] [23] [15] [18] [22] [2] [16] [14] [19] [17] [26] [34] [47] [38] -

Year - - - - 98 00 03 05 05 05 06 06 07 07 07 08 08 08 09 09 09 09 09 09 10 10 10 10 10 11 11 12 -

Code M M C C C C C C M M M S M M M M E M S M M M M M S M E E M 11 11 11 -

Category O O I I C C B C B/I I C G C S I I I C B C S S I P S G I C B B C I -

Torronto .50 .73 .66 .65 .63 .62 .69 .66 .63 .69 .62 .65 .66 .69 .65 .69 .69 .61 .67 .64 .68 .69 .69 .68 .61 .65 .68 .72 .67 .64 .64 .70 .66

NUSEF .49 .66 .62 .61 .57 .56 .63 - .59 .64 .56 .59 - .63 .60 .62 .63 - .61 - .61 - .61 .61 .63 - - .64 - .56 .57 .63 .60

MIT .50 .75 .65 .64 .62 .61 .67 .65 .63 .68 .58 .64 .57 .65 .63 .67 .65 .60 .65 .62 .66 .65 .65 .66 .61 .62 .64 .69 - .65 .65 .68 .64

Kootstra .50 .62 .58 .57 .58 .57 .59 .60 .58 .59 .57 .56 .57 .59 .58 .61 .59 .56 .56 .54 .58 .59 .60 .59 .57 .56 .58 .62 .56 - - .59 .58

Avg Rank - - 4.8 5.8 7.3 8.3 3 4.7 6.8 2.5 8.8 6.5 8 3.5 6 2.8 3.5 9.3 5.3 8 4.3 4 3.8 4 7 5.8 5 1 6.7 6.5 6.7 2.5 -

Table 1. Compared visual saliency models. Abbreviations are: M: Matlab, C: C/C++, E: Executables, S: Sent saliency maps. Note that STB [24] and

VOCUS [6] are two implementations of the Itti et al. [3] model. Numbers are maximum shufﬂed AUC scores of models by optimizing the saliency map

smoothness (Fig. 3). See supplement for optimal sigma values of Gaussian kernel σ where models take their maximums (σ from 0.01 : 0.01 : 0.13 in image

width). Model category belongs to one of these categories [30]: Cognitive (C), Bayesian (B), Decision-theoretic (D), Information-theoretic (I), Graphical

(G), Spectral-analysis (S), Pattern-classiﬁcation (P), Others (O). We observe lower performance for the STB model compared with either ITTI or ITTI98

models over free-viewing datasets. Thus, it’s use instead on the ITTI model (e.g., for model-based behavioral studies) is not encouraged. We employ two

different versions of the Itti et al. model: ITTI98 and ITTI, which correspond to different normalization schemes. In ITTI98, each feature map’s contribution

to the saliency map is weighted by the squared difference between the globally most active location and the average activity of all other local maxima in the

feature map [3]. This gives rise to smooth saliency maps, which tend to correlate better with noisy human eye movement data. In the ITTI model [33], the

spatial competition for saliency is much stronger, and is implemented in each feature map as 10 rounds of convolution by a large difference-of-Gaussians

followed by half-wave rectiﬁcation. This gives rise to much sparser saliency maps, which are more useful than the ITTI98 maps when trying to decide on

the single next location to look at (e.g., in machine vision and robotics applications). Kootstra dataset is the hardest one for humans (low IO agreement)

and models. Next hardest in the NUSEF dataset. Note that symmetry model of Kootstra can not compete with the other models over the Kootstra dataset

although there are many images with symmetric objects in this dataset. Numbers are rounded to their closest value (See supplement for more accurate

values). Borji et al. [30], results are on original images while here we report optimized results over smoothed images. In our experiments here we used the

myGauss=fspecial(’gaussian’,50,10) which was then normalized to [0 1]. In principle, smoothing Gaussian should be about 1

◦

− 2

◦

of the visual ﬁeld.

0 2 4 6 8 10 12 14

0.5

0.52

0.54

Shuffled AUC (sAUC) score

Smoothing parameter (Gaussian size)

0.56

0.58

0.6

AIM

SUN

AWS

GBVS

HouCVPR

HouNIPS

ITTI

Judd

PQFTt

Model

AIM

SUN

AWS

GBVS

HouCVPR

HouNIPS

ITTI

Judd

PQFT

Gauss

0.585

0.587

0.598

0.569

0.588

0.584

0.531

0.543

0.591

0.50

max sAUC

score

Figure 5. sAUC scores over emotional images of the NUSEF dataset.

Results on the right-hand table are the maxima over smoothing range.

top-performing models). The AWS model did the best over

all categories. HouNIPS, Judd, SDSR, Yan, and AIM also

ranked at the top. Faces are often located at the center while

nude and event stimuli are mostly off-center. Humans are

more correlated for portrait, event, and nude stimuli. A sep-

arate analysis over the Kootstra dataset showed that mod-

els have difﬁculty in saliency detection over nature stim-

uli where there are less distinctive and salient objects (See

supplement). This means that much progress remains to be

done in saliency detection over stimuli containing concep-

tual stimuli (e.g, images containing interacting objects, ac-

tions such as grasping, living vs. non-living, object regions

inside a bigger object i.e., faces, body parts, etc).

Behavioral studies have shown that affective stimuli in-

Mickey, positive, valence = 7.4Wolf, neutral, valence = 4.21Elderlywoman, negative, valence = 3.26 Watermelon, positive, valence = 7.04Firehydrant, neutral, valence = 5.24Harassment, negative, valence = 3.19

Img No: 4621 Img No: 2590 Img No: 1302 Img No: 7100

Img No: 1999 Img No: 7325

Figure 6. Sample emotional images with positive, negative, and neutral

emotional valence from NUSEF dataset along with saliency maps of the

AWS model. Note that in some cases saliency misses ﬁxations.

ﬂuence the way we look at images. Humphrey et al. [45]

showed that initial ﬁxations were more likely to be on emo-

tional objects than more visually salient neutral ones. Here

we take a close look at model differences over emotional

(affective) stimuli, using 287 images from the NUSEF be-

longing to the IAPS dataset [46]. Fig. 5 shows sAUC

scores of 10 models over affective stimuli. These val-

ues are smaller than the ones for NUSEF (non-affective)

shown in Table 1. Our results (using shufﬂed AUC and

with smoothing similar to Table 1; see supplement) sug-

gest that only a fraction of ﬁxations landed on emotional

image regions, possibly due to bottom-up saliency (inter-

action between saliency and emotion; AWS on emotional

= 0.59, non-emotional = 0.69). Models AWS, PQFT, and

HouNIPS outperform others over these stimuli. These mod-

els also performed well on non-emotional stimuli. Fig. 6

shows saliency maps of some emotional images.

Predicting scanpath: Not only humans are correlated in

terms of the locations they ﬁxate, but they also agree some-

what in the order of their ﬁxations [31, 32]. In the con-

text of saliency modeling, few models have aimed to pre-

dict scanpath sequence, partly due to difﬁculty in measuring

Analysis of Scores, Datasets, and Models in Visual Saliency Prediction

Figures

Citations

Salient Object Detection: A Survey

Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey

What Do Different Evaluation Metrics Tell Us About Saliency Models

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.

DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations

References

Visual saliency based on conditional entropy

Biological plausibility of spectral domain approach for spatiotemporal visual saliency

Overview of Eye tracking Datasets

Decorrelation and Distinctiveness Provide with Human-Like Saliency

Objects do not predict fixations better than early saliency: a re-analysis of Einhauser et al.'s data.

Related Papers (5)

A model of saliency-based visual attention for rapid scene analysis

Learning to predict where humans look

Graph-Based Visual Saliency

State-of-the-Art in Visual Attention Modeling

SUN: A Bayesian framework for saliency using natural statistics.

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Analysis of scores, datasets, and models in visual saliency prediction" ?

Q2. What future works have the authors mentioned in the paper "Analysis of scores, datasets, and models in visual saliency prediction" ?

Q3. What are the common types of stimuli used in neurophysiological and modeling works?

Q4. What are the two main causes of CB?

Q5. What is the difficult challenge in the fixation datasets?

Q6. Why are there still inconsistencies in the results of previous benchmarks?

Q7. What is the reason for the lack of models to predict scanpath sequence?

Q8. How do the authors make a fixation histogram?

Q9. What is the way to compute the histograms for a given image?

Q10. Why do the authors believe it is important to measure the gap between the IO model and models?

Q11. What is the way to tune the parameters in a model?