Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics

doi:10.1109/ICCV.2013.147

Home
/
Papers
/
Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics

Proceedings Article•DOI•

Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics

Nicolas Riche¹, Matthieu Duvinage¹, Matei Mancas¹, Bernard Gosselin¹, Thierry Dutoit¹ - Show less +1 more•Institutions (1)

University of Mons¹

01 Dec 2013-pp 1153-1160

TL;DR: This paper compares the ranking of 12 state-of-the art saliency models using 12 similarity metrics and shows that some of the metrics are strongly correlated leading to a redundancy in the performance metrics reported in the available benchmarks.

read less

Abstract: Visual saliency has been an increasingly active research area in the last ten years with dozens of saliency models recently published. Nowadays, one of the big challenges in the field is to find a way to fairly evaluate all of these models. In this paper, on human eye fixations, we compare the ranking of 12 state-of-the art saliency models using 12 similarity metrics. The comparison is done on Jian Li's database containing several hundreds of natural images. Based on Kendall concordance coefficient, it is shown that some of the metrics are strongly correlated leading to a redundancy in the performance metrics reported in the available benchmarks. On the other hand, other metrics provide a more diverse picture of models' overall performance. As a recommendation, three similarity metrics should be used to obtain a complete point of view of saliency model performance.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Deep Visual Attention Prediction

[...]

Wenguan Wang¹, Jianbing Shen¹•Institutions (1)

Beijing Institute of Technology¹

01 May 2018-IEEE Transactions on Image Processing

TL;DR: Wang et al. as discussed by the authors proposed a skip-layer network structure to predict human attention from multiple convolutional layers with various reception fields, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales.

...read moreread less

Abstract: In this paper, we aim to predict human eye fixation with view-free scenes based on an end-to-end deep learning architecture. Although convolutional neural networks (CNNs) have made substantial improvement on human attention prediction, it is still needed to improve the CNN-based attention models by efficiently leveraging multi-scale features. Our visual attention network is proposed to capture hierarchical saliency information from deep, coarse layers with global saliency information to shallow, fine layers with local saliency response. Our model is based on a skip-layer network structure, which predicts human attention from multiple convolutional layers with various reception fields. Final saliency prediction is achieved via the cooperation of those global and local predictions. Our model is learned in a deep supervision manner, where supervision is directly fed into multi-level layers, instead of previous approaches of providing supervision only at the output layer and propagating this supervision back to earlier layers. Our model thus incorporates multi-level saliency predictions within a single network, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales. Extensive experimental analysis on various challenging benchmark data sets demonstrate our method yields the state-of-the-art performance with competitive inference time. 1 1 Our source code is available at https://github.com/wenguanwang/deepattention .

...read moreread less

532 citations

Journal Article•DOI•

What Do Different Evaluation Metrics Tell Us About Saliency Models

[...]

Zoya Bylinskii¹, Tilke Judd², Aude Oliva¹, Antonio Torralba¹, Frédo Durand¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Google²

01 Mar 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper provides an analysis of 8 different evaluation metrics and their properties, and makes recommendations for metric selections under specific assumptions and for specific applications.

...read moreread less

Abstract: How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.

...read moreread less

526 citations

Cites background or methods from "Saliency and Human Fixations: State..."

...The inherent ambiguity in how saliency and ground truth are represented leads to different choices of metrics for reporting performance [9], [14], [53], [64], [86]....
[...]
...[64] provided an evaluation 12 saliency models with 12 similarity metrics on Jian Li’s dataset [44]....
[...]
...Metric Denoted here Evaluation papers appearing in Area under ROC Curve AUC [9], [21], [22], [45], [53], [64], [86], [91] Shuffled AUC sAUC [8], [9], [45], [64] Normalized Scanpath Saliency NSS [8], [9], [21], [45], [53], [64], [86], [91] Pearson’s Correlation Coefficient CC [8], [9], [21], [22], [45], [64], [86] Earth Mover’s Distance EMD [45], [64], [91] Similarity or histogram intersection SIM [45], [64]...
[...]
...Kullback-Leibler divergence KL [21], [45], [64], [86]...
[...]

Journal Article•DOI•

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.

[...]

Marcella Cornia¹, Lorenzo Baraldi¹, Giuseppe Serra², Rita Cucchiara¹•Institutions (2)

University of Modena and Reggio Emilia¹, University of Udine²

29 Jun 2018-IEEE Transactions on Image Processing

TL;DR: Zhang et al. as mentioned in this paper proposed a convolutional long short-term memory (LSTM) network to iteratively refine the predicted saliency map by focusing on the most salient regions of the input image.

...read moreread less

Abstract: Data-driven saliency has recently gained a lot of attention thanks to the use of convolutional neural networks for predicting gaze fixations. In this paper, we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of our solution is a convolutional long short-term memory that focuses on the most salient regions of the input image to iteratively refine the predicted saliency map. In addition, to tackle the center bias typical of human eye fixations, our model can learn a set of prior maps generated with Gaussian functions. We show, through an extensive evaluation, that the proposed architecture outperforms the current state-of-the-art on public saliency prediction datasets. We further study the contribution of each key component to demonstrate their robustness on different scenarios.

...read moreread less

503 citations

Journal Article•DOI•

DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations

[...]

Srinivas S S Kruthiventi¹, Kumar Ayush², R. Venkatesh Babu¹•Institutions (2)

Indian Institute of Science¹, Indian Institute of Technology Kharagpur²

01 Jun 2017-IEEE Transactions on Image Processing

TL;DR: DeepFix as mentioned in this paper proposes a fully convolutional neural network (FCN) which models the bottom-up mechanism of visual attention via saliency prediction and predicts the saliency map in an end-to-end manner.

...read moreread less

Abstract: Understanding and predicting the human visual attention mechanism is an active area of research in the fields of neuroscience and computer vision. In this paper, we propose DeepFix, a fully convolutional neural network, which models the bottom–up mechanism of visual attention via saliency prediction. Unlike classical works, which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts the saliency map in an end-to-end manner. DeepFix is designed to capture semantics at multiple scales while taking global context into account, by using network layers with very large receptive fields. Generally, fully convolutional nets are spatially invariant—this prevents them from modeling location-dependent patterns (e.g., centre-bias). Our network handles this by incorporating a novel location-biased convolutional layer. We evaluate our model on multiple challenging saliency data sets and show that it achieves the state-of-the-art results.

...read moreread less

443 citations

Journal Article•

SalGAN: visual saliency prediction with generative adversarial networks

[...]

Junting Pan, Cristian Canton-Ferrer, Kevin McGuinness, Noel E. O'Connor, Jordi Torres, Elisa Sayrol, Xavier Giro-i-Nieto - Show less +3 more

04 Jan 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples and shows how adversarial training allows reaching state-of-the-art performance across different metrics when combined with a widely-used loss function like BCE.

...read moreread less

Abstract: We introduce SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples. The first stage of the network consists of a generator model whose weights are learned by back-propagation computed from a binary cross entropy (BCE) loss over downsampled versions of the saliency maps. The resulting prediction is processed by a discriminator network trained to solve a binary classification task between the saliency maps generated by the generative stage and the ground truth ones. Our experiments show how adversarial training allows reaching state-of-the-art performance across different metrics when combined with a widely-used loss function like BCE. Our results can be reproduced with the source code and trained models available at https://imatge-upc.github. io/saliency-salgan-2017/.

...read moreread less

339 citations

Cites background from "Saliency and Human Fixations: State..."

...Finally, Section 6 closes the paper by drawing the main conclusions....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A model of saliency-based visual attention for rapid scene analysis

[...]

Laurent Itti¹, Christof Koch¹, Ernst Niebur²•Institutions (2)

California Institute of Technology¹, Johns Hopkins University²

01 Nov 1998-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this article, a visual attention system inspired by the behavior and the neuronal architecture of the early primate visual system is presented, where multiscale image features are combined into a single topographical saliency map.

...read moreread less

Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

...read moreread less

10,525 citations

A model of saliency-based visual attention for rapid scene analysis

[...]

Laurent Itti

01 Jan 1998

TL;DR: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented, which breaks down the complex problem of scene understanding by rapidly selecting conspicuous locations to be analyzed in detail.

...read moreread less

8,566 citations

Additional excerpts

...Itti’s model [12] represents the cognitive approach....
[...]

Book•

Statistical Methods for Psychology

[...]

David C. Howell

01 Jan 1982

TL;DR: The Statistical Methods for Psychology as discussed by the authors survey statistical techniques commonly used in the behavioral and social sciences, especially psychology and education, and is suitable for either a one-term or a full-year course, and has been used successfully for both.

...read moreread less

Abstract: This seventh edition of Statistical Methods for Psychology, like the previous editions, surveys statistical techniques commonly used in the behavioral and social sciences, especially psychology and education. Although it is designed for advanced undergraduates and graduate students, it does not assume that students have had either a previous course in statistics or a course in mathematics beyond high-school algebra. Those students who have had an introductory course will find that the early material provides a welcome review. The book is suitable for either a one-term or a full-year course, and I have used it successfully for both. Since I have found that students, and faculty, frequently refer back to the book from which they originally learned statistics when they have a statistical problem, I have included material that will make the book a useful reference for future use. The instructor who wishes to omit this material will have no difficulty doing so. I have cut back on that material, however, to include only what is still likely to be useful. The idea of including every interesting idea had led to a book that was beginning to be daunting.

...read moreread less

7,579 citations

"Saliency and Human Fixations: State..." refers background or methods in this paper

...Furthermore, some rules of thumb are provided [11] to allow the researcher to interpret this measure as depicted in Tab....
[...]
...To compare models rank according to the different metrics, Kendall’s W concordance measure [11] is used (as defined as Eq....
[...]

Book Chapter•DOI•

Shifts in selective visual attention: towards the underlying neural circuitry.

[...]

Christof Koch¹, Shimon Ullman²•Institutions (2)

California Institute of Technology¹, Massachusetts Institute of Technology²

01 Jan 1985-Human neurobiology

TL;DR: This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention and suggests a possible role for the extensive back-projection from the visual cortex to the LGN.

...read moreread less

Abstract: Psychophysical and physiological evidence indicates that the visual system of primates and humans has evolved a specialized processing focus moving across the visual scene. This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention. Specifically, we propose the following: (1) A number of elementary features, such as color, orientation, direction of movement, disparity etc. are represented in parallel in different topographical maps, called the early representation. (2) There exists a selective mapping from the early topographic representation into a more central non-topographic representation, such that at any instant the central representation contains the properties of only a single location in the visual scene, the selected location. We suggest that this mapping is the principal expression of early selective visual attention. One function of selective attention is to fuse information from different maps into one coherent whole. (3) Certain selection rules determine which locations will be mapped into the central representation. The major rule, using the conspicuity of locations in the early representation, is implemented using a so-called Winner-Take-All network. Inhibiting the selected location in this network causes an automatic shift towards the next most conspicious location. Additional rules are proximity and similarity preferences. We discuss how these rules can be implemented in neuron-like networks and suggest a possible role for the extensive back-projection from the visual cortex to the LGN.

...read moreread less

3,930 citations

"Saliency and Human Fixations: State..." refers background in this paper

...In the field of computer vision, a wide variety of models that aim at mimicking the visual attention cognitive process exists [15] [30]....
[...]

Proceedings Article•DOI•

Frequency-tuned salient region detection

[...]

Radhakrishna Achanta¹, Sheila S. Hemami², Francisco J. Estrada¹, Sabine Süsstrunk¹•Institutions (2)

École Normale Supérieure¹, Cornell University²

20 Jun 2009

TL;DR: This paper introduces a method for salient region detection that outputs full resolution saliency maps with well-defined boundaries of salient objects that outperforms the five algorithms both on the ground-truth evaluation and on the segmentation task by achieving both higher precision and better recall.

...read moreread less

Abstract: Detection of visually salient image regions is useful for applications like object segmentation, adaptive compression, and object recognition. In this paper, we introduce a method for salient region detection that outputs full resolution saliency maps with well-defined boundaries of salient objects. These boundaries are preserved by retaining substantially more frequency content from the original image than other existing techniques. Our method exploits features of color and luminance, is simple to implement, and is computationally efficient. We compare our algorithm to five state-of-the-art salient region detection methods with a frequency domain analysis, ground truth, and a salient object segmentation application. Our method outperforms the five algorithms both on the ground-truth evaluation and on the segmentation task by achieving both higher precision and better recall.

...read moreread less

3,723 citations