Combining Multiple Cues for Visual Madlibs Question Answering

doi:10.1007/S11263-018-1096-0

Open AccessJournal ArticleDOI

Combining Multiple Cues for Visual Madlibs Question Answering

Tatiana Tommasi, +5 more

- 15 Jan 2019 -

International Journal of Computer Vision

- Vol. 127, Iss: 1, pp 38-60

TLDR

In this paper, the authors present an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset, which employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification and attribute prediction.

Abstract:

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Visual question answering: a state-of-the-art review

Sruthy Manmadhan, +1 more

- 01 Dec 2020 -

Artificial Intelligence Review

TL;DR: This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics and discusses future research directions for all the above-mentioned aspects of V QA separately.

...read moreread less

Proceedings ArticleDOI

VQD: Visual Query Detection In Natural Scenes

Manoj Acharya, +2 more

TL;DR: In this paper, a new visual grounding task called Visual Query Detection (VQD) is proposed, where the task is to localize a variable number of objects in an image where the objects are specified in natural language.

...read moreread less

Posted Content

VQD: Visual Query Detection in Natural Scenes

Manoj Acharya, +2 more

- 04 Apr 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The first algorithms for VQD are proposed, and they are evaluated on both visual referring expression datasets and the authors' new V QDv1 dataset.

...read moreread less

Journal ArticleDOI

A survey of methods, datasets and evaluation metrics for visual question answering

Himanshu Sharma, +1 more

- 01 Dec 2021 -

Image and Vision Computing

TL;DR: This paper has discussed some of the core concepts used in VQA systems and presented a comprehensive survey of efforts in the past to address this problem, and discussed some new datasets developed in 2019 and 2020.

...read moreread less

Journal ArticleDOI

New ideas and trends in deep multimodal content understanding : a review

Wei Chen, +3 more

- 22 Feb 2021 -

Neurocomputing

TL;DR: A survey of multimodal deep learning can be found in this paper, where the authors examine recent multimodAL deep models and structures, including auto-encoders, generative adversarial nets and their variants.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Collapse

arXiv: Computation and Language

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

Combining Multiple Cues for Visual Madlibs Question Answering

Citations

Visual question answering: a state-of-the-art review

VQD: Visual Query Detection In Natural Scenes

VQD: Visual Query Detection in Natural Scenes

A survey of methods, datasets and evaluation metrics for visual question answering

New ideas and trends in deep multimodal content understanding : a review

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Going deeper with convolutions

ImageNet Large Scale Visual Recognition Challenge

Microsoft COCO: Common Objects in Context

Related Papers (5)

Visual7W: Grounded Question Answering in Images

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

VQA: Visual Question Answering

Microsoft COCO: Common Objects in Context

Deep Residual Learning for Image Recognition