Understanding Low- and High-Level Contributions to Fixation Prediction

doi:10.1109/ICCV.2017.513

Proceedings ArticleDOI

Understanding Low- and High-Level Contributions to Fixation Prediction

Matthias Kümmerer, +3 more

- pp 4799-4808

Chats0

TLDR

Comparing different features within the same powerful readout architecture allows to better understand the relevance of low- versus high-level features in predicting fixation locations, while simultaneously achieving state-of-the-art saliency prediction.

Abstract:

Understanding where people look in images is an important problem in computer vision. Despite significant research, it remains unclear to what extent human fixations can be predicted by low-level (contrast) compared to highlevel (presence of objects) image features. Here we address this problem by introducing two novel models that use different feature spaces but the same readout architecture. The first model predicts human fixations based on deep neural network features trained on object recognition. This model sets a new state-of-the art in fixation prediction by achieving top performance in area under the curve metrics on the MIT300 hold-out benchmark (AUC = 88%, sAUC = 77%, NSS = 2.34). The second model uses purely low-level (isotropic contrast) features. This model achieves better performance than all models not using features pretrained on object recognition, making it a strong baseline to assess the utility of high-level features. We then evaluate and visualize which fixations are better explained by lowlevel compared to high-level image features. Surprisingly we find that a substantial proportion of fixations are better explained by the simple low-level model than the stateof- the-art model. Comparing different features within the same powerful readout architecture allows us to better understand the relevance of low- versus high-level features in predicting fixation locations, while simultaneously achieving state-of-the-art saliency prediction.

Understanding Low- and High-Level Contributions to Fixation Prediction

Citations

Saliency Map Extraction in Human Crowd RGB Data

Visual Attention: Deep Rare Features

A Novel Lightweight Audio-visual Saliency Model for Videos

Deep Saliency Prior for Reducing Visual Distraction.

TranSalNet: Visual saliency prediction using transformers.

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Going deeper with convolutions

Caffe: Convolutional Architecture for Fast Feature Embedding

Related Papers (5)

A model of saliency-based visual attention for rapid scene analysis

Learning to predict where humans look

SALICON: Saliency in Context

Graph-Based Visual Saliency

SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks