scispace - formally typeset
Open AccessPosted Content

Object Referring in Videos with Language and Human Gaze

Reads0
Chats0
TLDR
Zhang et al. as mentioned in this paper proposed a novel network model for object referring in videos by integrating appearance, motion, gaze, and spatio-temporal context into one network.
Abstract
We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previousOR methods. For dataset and code, please refer this https URL.

read more

Citations
More filters
Posted Content

Video Object Segmentation with Language Referring Expressions

TL;DR: In this article, the authors explore an alternative way of identifying a target object, namely by employing language referring expressions, which can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations.
Journal ArticleDOI

When I Look into Your Eyes: A Survey on Computer Vision Contributions for Human Gaze Estimation and Tracking

TL;DR: This work represents an attempt to fill the gap in gaze tracking by considering gaze tracking as a more exhaustive task that aims at estimating gaze target from different perspectives, and introducing a wider point of view that brings to a new taxonomy.
Posted Content

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

TL;DR: In this paper, the authors address weakly-supervised spatio-temporally grounding natural sentence in video by using an attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors.
Proceedings ArticleDOI

Generating Easy-to-Understand Referring Expressions for Target Identifications

TL;DR: This paper addresses the generation of referring expressions that not only refer to objects correctly but also let humans find them quickly by using the time required to locate the referred objects by humans and their accuracies.
Journal ArticleDOI

Deep gaze pooling: Inferring and visually decoding search intents from human gaze fixations

TL;DR: This work proposes the first approach to predict categories and attributes of search intents from gaze data and to visually reconstruct plausible targets, and highlights several practical advantages, such as compatibility with existing architectures, no need for gaze training data, and robustness to noise from common gaze sources.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Posted Content

Deep Residual Learning for Image Recognition

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Related Papers (5)