scispace - formally typeset
Open AccessPosted Content

Generating Question Relevant Captions to Aid Visual Question Answering

TLDR
This work presents a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question.
Abstract
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.

read more

Citations
More filters
Journal ArticleDOI

Multimodal research in vision and language: A review of current and emerging trends

TL;DR: A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation.
Proceedings ArticleDOI

All You May Need for VQA are Image Captions

TL;DR: This paper proposes a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation, and shows that the resulting data is of high-quality.
Journal ArticleDOI

Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning

TL;DR: This state-of-the-art report investigates the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies.
Journal ArticleDOI

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

TL;DR: Img2Prompt as discussed by the authors is a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training.
Posted Content

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends

TL;DR: Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data as mentioned in this paper, which has enhanced research interests in the intersection of the vision and language arena with its numerous applications and fast-paced growth.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings ArticleDOI

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
Related Papers (5)