scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics

01 Dec 2013-pp 1153-1160
TL;DR: This paper compares the ranking of 12 state-of-the art saliency models using 12 similarity metrics and shows that some of the metrics are strongly correlated leading to a redundancy in the performance metrics reported in the available benchmarks.
Abstract: Visual saliency has been an increasingly active research area in the last ten years with dozens of saliency models recently published. Nowadays, one of the big challenges in the field is to find a way to fairly evaluate all of these models. In this paper, on human eye fixations, we compare the ranking of 12 state-of-the art saliency models using 12 similarity metrics. The comparison is done on Jian Li's database containing several hundreds of natural images. Based on Kendall concordance coefficient, it is shown that some of the metrics are strongly correlated leading to a redundancy in the performance metrics reported in the available benchmarks. On the other hand, other metrics provide a more diverse picture of models' overall performance. As a recommendation, three similarity metrics should be used to obtain a complete point of view of saliency model performance.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a skip-layer network structure to predict human attention from multiple convolutional layers with various reception fields, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales.
Abstract: In this paper, we aim to predict human eye fixation with view-free scenes based on an end-to-end deep learning architecture. Although convolutional neural networks (CNNs) have made substantial improvement on human attention prediction, it is still needed to improve the CNN-based attention models by efficiently leveraging multi-scale features. Our visual attention network is proposed to capture hierarchical saliency information from deep, coarse layers with global saliency information to shallow, fine layers with local saliency response. Our model is based on a skip-layer network structure, which predicts human attention from multiple convolutional layers with various reception fields. Final saliency prediction is achieved via the cooperation of those global and local predictions. Our model is learned in a deep supervision manner, where supervision is directly fed into multi-level layers, instead of previous approaches of providing supervision only at the output layer and propagating this supervision back to earlier layers. Our model thus incorporates multi-level saliency predictions within a single network, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales. Extensive experimental analysis on various challenging benchmark data sets demonstrate our method yields the state-of-the-art performance with competitive inference time. 1 1 Our source code is available at https://github.com/wenguanwang/deepattention .

532 citations

Journal ArticleDOI
TL;DR: This paper provides an analysis of 8 different evaluation metrics and their properties, and makes recommendations for metric selections under specific assumptions and for specific applications.
Abstract: How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.

526 citations


Cites background or methods from "Saliency and Human Fixations: State..."

  • ...The inherent ambiguity in how saliency and ground truth are represented leads to different choices of metrics for reporting performance [9], [14], [53], [64], [86]....

    [...]

  • ...[64] provided an evaluation 12 saliency models with 12 similarity metrics on Jian Li’s dataset [44]....

    [...]

  • ...Metric Denoted here Evaluation papers appearing in Area under ROC Curve AUC [9], [21], [22], [45], [53], [64], [86], [91] Shuffled AUC sAUC [8], [9], [45], [64] Normalized Scanpath Saliency NSS [8], [9], [21], [45], [53], [64], [86], [91] Pearson’s Correlation Coefficient CC [8], [9], [21], [22], [45], [64], [86] Earth Mover’s Distance EMD [45], [64], [91] Similarity or histogram intersection SIM [45], [64]...

    [...]

  • ...Kullback-Leibler divergence KL [21], [45], [64], [86]...

    [...]

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a convolutional long short-term memory (LSTM) network to iteratively refine the predicted saliency map by focusing on the most salient regions of the input image.
Abstract: Data-driven saliency has recently gained a lot of attention thanks to the use of convolutional neural networks for predicting gaze fixations. In this paper, we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of our solution is a convolutional long short-term memory that focuses on the most salient regions of the input image to iteratively refine the predicted saliency map. In addition, to tackle the center bias typical of human eye fixations, our model can learn a set of prior maps generated with Gaussian functions. We show, through an extensive evaluation, that the proposed architecture outperforms the current state-of-the-art on public saliency prediction datasets. We further study the contribution of each key component to demonstrate their robustness on different scenarios.

503 citations

Journal ArticleDOI
TL;DR: DeepFix as mentioned in this paper proposes a fully convolutional neural network (FCN) which models the bottom-up mechanism of visual attention via saliency prediction and predicts the saliency map in an end-to-end manner.
Abstract: Understanding and predicting the human visual attention mechanism is an active area of research in the fields of neuroscience and computer vision. In this paper, we propose DeepFix, a fully convolutional neural network, which models the bottom–up mechanism of visual attention via saliency prediction. Unlike classical works, which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts the saliency map in an end-to-end manner. DeepFix is designed to capture semantics at multiple scales while taking global context into account, by using network layers with very large receptive fields. Generally, fully convolutional nets are spatially invariant—this prevents them from modeling location-dependent patterns (e.g., centre-bias). Our network handles this by incorporating a novel location-biased convolutional layer. We evaluate our model on multiple challenging saliency data sets and show that it achieves the state-of-the-art results.

443 citations

Journal Article
TL;DR: This work introduces SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples and shows how adversarial training allows reaching state-of-the-art performance across different metrics when combined with a widely-used loss function like BCE.
Abstract: We introduce SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples. The first stage of the network consists of a generator model whose weights are learned by back-propagation computed from a binary cross entropy (BCE) loss over downsampled versions of the saliency maps. The resulting prediction is processed by a discriminator network trained to solve a binary classification task between the saliency maps generated by the generative stage and the ground truth ones. Our experiments show how adversarial training allows reaching state-of-the-art performance across different metrics when combined with a widely-used loss function like BCE. Our results can be reproduced with the source code and trained models available at https://imatge-upc.github. io/saliency-salgan-2017/.

339 citations


Cites background from "Saliency and Human Fixations: State..."

  • ...Finally, Section 6 closes the paper by drawing the main conclusions....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, a visual attention system inspired by the behavior and the neuronal architecture of the early primate visual system is presented, where multiscale image features are combined into a single topographical saliency map.
Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

10,525 citations

01 Jan 1998
TL;DR: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented, which breaks down the complex problem of scene understanding by rapidly selecting conspicuous locations to be analyzed in detail.

8,566 citations


Additional excerpts

  • ...Itti’s model [12] represents the cognitive approach....

    [...]

Book
01 Jan 1982
TL;DR: The Statistical Methods for Psychology as discussed by the authors survey statistical techniques commonly used in the behavioral and social sciences, especially psychology and education, and is suitable for either a one-term or a full-year course, and has been used successfully for both.
Abstract: This seventh edition of Statistical Methods for Psychology, like the previous editions, surveys statistical techniques commonly used in the behavioral and social sciences, especially psychology and education. Although it is designed for advanced undergraduates and graduate students, it does not assume that students have had either a previous course in statistics or a course in mathematics beyond high-school algebra. Those students who have had an introductory course will find that the early material provides a welcome review. The book is suitable for either a one-term or a full-year course, and I have used it successfully for both. Since I have found that students, and faculty, frequently refer back to the book from which they originally learned statistics when they have a statistical problem, I have included material that will make the book a useful reference for future use. The instructor who wishes to omit this material will have no difficulty doing so. I have cut back on that material, however, to include only what is still likely to be useful. The idea of including every interesting idea had led to a book that was beginning to be daunting.

7,579 citations


"Saliency and Human Fixations: State..." refers background or methods in this paper

  • ...Furthermore, some rules of thumb are provided [11] to allow the researcher to interpret this measure as depicted in Tab....

    [...]

  • ...To compare models rank according to the different metrics, Kendall’s W concordance measure [11] is used (as defined as Eq....

    [...]

Book ChapterDOI
TL;DR: This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention and suggests a possible role for the extensive back-projection from the visual cortex to the LGN.
Abstract: Psychophysical and physiological evidence indicates that the visual system of primates and humans has evolved a specialized processing focus moving across the visual scene. This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention. Specifically, we propose the following: (1) A number of elementary features, such as color, orientation, direction of movement, disparity etc. are represented in parallel in different topographical maps, called the early representation. (2) There exists a selective mapping from the early topographic representation into a more central non-topographic representation, such that at any instant the central representation contains the properties of only a single location in the visual scene, the selected location. We suggest that this mapping is the principal expression of early selective visual attention. One function of selective attention is to fuse information from different maps into one coherent whole. (3) Certain selection rules determine which locations will be mapped into the central representation. The major rule, using the conspicuity of locations in the early representation, is implemented using a so-called Winner-Take-All network. Inhibiting the selected location in this network causes an automatic shift towards the next most conspicious location. Additional rules are proximity and similarity preferences. We discuss how these rules can be implemented in neuron-like networks and suggest a possible role for the extensive back-projection from the visual cortex to the LGN.

3,930 citations


"Saliency and Human Fixations: State..." refers background in this paper

  • ...In the field of computer vision, a wide variety of models that aim at mimicking the visual attention cognitive process exists [15] [30]....

    [...]

Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper introduces a method for salient region detection that outputs full resolution saliency maps with well-defined boundaries of salient objects that outperforms the five algorithms both on the ground-truth evaluation and on the segmentation task by achieving both higher precision and better recall.
Abstract: Detection of visually salient image regions is useful for applications like object segmentation, adaptive compression, and object recognition. In this paper, we introduce a method for salient region detection that outputs full resolution saliency maps with well-defined boundaries of salient objects. These boundaries are preserved by retaining substantially more frequency content from the original image than other existing techniques. Our method exploits features of color and luminance, is simple to implement, and is computationally efficient. We compare our algorithm to five state-of-the-art salient region detection methods with a frequency domain analysis, ground truth, and a salient object segmentation application. Our method outperforms the five algorithms both on the ground-truth evaluation and on the segmentation task by achieving both higher precision and better recall.

3,723 citations


"Saliency and Human Fixations: State..." refers methods in this paper

  • ...SR [9], PFT [8], PQFT [8] and Achanta [1] use a spectral analysis approach to compute their saliency map....

    [...]