Object Referring in Videos with Language and Human Gaze

Open AccessPosted Content

Object Referring in Videos with Language and Human Gaze

Arun Balajee Vasudevan, +2 more

- 04 Jan 2018 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

Zhang et al. as mentioned in this paper proposed a novel network model for object referring in videos by integrating appearance, motion, gaze, and spatio-temporal context into one network.

Abstract:

We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previousOR methods. For dataset and code, please refer this https URL.

Citations

PDF

Open Access

More filters

Posted Content

Video Object Segmentation with Language Referring Expressions

Anna Khoreva, +2 more

- 21 Mar 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, the authors explore an alternative way of identifying a target object, namely by employing language referring expressions, which can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations.

...read moreread less

Journal ArticleDOI

When I Look into Your Eyes: A Survey on Computer Vision Contributions for Human Gaze Estimation and Tracking

Dario Cazzato, +3 more

- 03 Jul 2020 -

Sensors

TL;DR: This work represents an attempt to fill the gap in gaze tracking by considering gaze tracking as a more exhaustive task that aims at estimating gaze target from different perspectives, and introducing a wider point of view that brings to a new taxonomy.

...read moreread less

Posted Content

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Zhenfang Chen, +3 more

- 06 Jun 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, the authors address weakly-supervised spatio-temporally grounding natural sentence in video by using an attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors.

...read moreread less

Proceedings ArticleDOI

Generating Easy-to-Understand Referring Expressions for Target Identifications

Mikihiro Tanaka, +5 more

TL;DR: This paper addresses the generation of referring expressions that not only refer to objects correctly but also let humans find them quickly by using the time required to locate the referred objects by humans and their accuracies.

...read moreread less

Journal ArticleDOI

Deep gaze pooling: Inferring and visually decoding search intents from human gaze fixations

Hosnieh Sattar, +2 more

- 28 Apr 2020 -

Neurocomputing

TL;DR: This work proposes the first approach to predict categories and attributes of search intents from gaze data and to visually reconstruct plausible targets, and highlights several practical advantages, such as compatibility with existing architectures, no need for gaze training data, and robustness to noise from common gaze sources.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Collapse

Object Referring in Videos with Language and Human Gaze

Citations

Video Object Segmentation with Language Referring Expressions

When I Look into Your Eyes: A Survey on Computer Vision Contributions for Human Gaze Estimation and Tracking

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Generating Easy-to-Understand Referring Expressions for Target Identifications

Deep gaze pooling: Inferring and visually decoding search intents from human gaze fixations

References

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

Deep Residual Learning for Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Glove: Global Vectors for Word Representation

Related Papers (5)

Object Referring in Videos with Language and Human Gaze

Microsoft COCO: Common Objects in Context

Modeling Context in Referring Expressions

Generation and Comprehension of Unambiguous Object Descriptions

Deep Residual Learning for Image Recognition