TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

doi:10.1109/CVPR.2017.149

Open AccessProceedings ArticleDOI

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Yunseok Jang, +4 more

- pp 1359-1367

Chats0

TLDR

In this paper, a dual-LSTM-based approach with both spatial and temporal attention is proposed for video VQA, which requires spatio-temporal reasoning from videos to answer questions correctly.

Abstract:

Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention, and show its effectiveness over conventional VQA techniques through empirical evaluations.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

From Recognition to Cognition: Visual Commonsense Reasoning

Rowan Zellers, +3 more

TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

...read moreread less

Journal Article

Visual Dialog

Abhishek Das, +8 more

- 01 May 2019 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The authors introduced the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content, given an image, a dialog history and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately.

...read moreread less

Proceedings ArticleDOI

Localizing Moments in Video with Natural Language

Lisa Anne Hendricks, +6 more

TL;DR: In this paper, a Moment Context Network (MCNCLN) is proposed to localize natural language queries in videos by integrating local and global video features over time, which can identify a specific temporal segment, or moment, from a video given a natural language text description.

...read moreread less

Proceedings ArticleDOI

Embodied Question Answering

Abhishek Das, +5 more

TL;DR: A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question.

...read moreread less

Posted Content

Embodied Question Answering

Abhishek Das, +5 more

- 30 Nov 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Collapse

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Citations

From Recognition to Cognition: Visual Commonsense Reasoning

Visual Dialog

Localizing Moments in Video with Natural Language

Embodied Question Answering

Embodied Question Answering

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

ImageNet Large Scale Visual Recognition Challenge

Glove: Global Vectors for Word Representation

Neural Machine Translation by Jointly Learning to Align and Translate

Related Papers (5)

VQA: Visual Question Answering

Deep Residual Learning for Image Recognition

Glove: Global Vectors for Word Representation

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Stacked Attention Networks for Image Question Answering