From Recognition to Cognition: Visual Commonsense Reasoning

doi:10.1109/CVPR.2019.00688

Open AccessProceedings ArticleDOI

From Recognition to Cognition: Visual Commonsense Reasoning

- pp 6720-6731

TLDR

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

Abstract:

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

Citations

PDF

Open Access

More filters

Book ChapterDOI

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, +7 more

TL;DR: UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

...read moreread less

Posted Content

VisualBERT: A Simple and Performant Baseline for Vision and Language.

Liunian Harold Li, +4 more

- 09 Aug 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

...read moreread less

Proceedings Article

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, +3 more

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Posted Content

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su, +6 more

- 22 Aug 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.

...read moreread less

Journal ArticleDOI

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training.

Gen Li, +4 more

TL;DR: After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Collapse

International Journal of Computer Vision

From Recognition to Cognition: Visual Commonsense Reasoning

Citations

UNITER: UNiversal Image-TExt Representation Learning

VisualBERT: A Simple and Performant Baseline for Vision and Language.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training.

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Long short-term memory

ImageNet: A large-scale hierarchical image database

Glove: Global Vectors for Word Representation

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

VQA: Visual Question Answering

Deep Residual Learning for Image Recognition

Attention is All you Need

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations