Generating Question Relevant Captions to Aid Visual Question Answering

Open AccessPosted Content

Generating Question Relevant Captions to Aid Visual Question Answering

- 03 Jun 2019 -

arXiv: Computer Vision and Pattern Recog...

TLDR

This work presents a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question.

Abstract:

Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Multimodal research in vision and language: A review of current and emerging trends

Shagun Uppal, +6 more

- 01 Jan 2022 -

Information Fusion

TL;DR: A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation.

...read moreread less

Proceedings ArticleDOI

All You May Need for VQA are Image Captions

Soravit Changpinyo, +5 more

TL;DR: This paper proposes a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation, and shows that the resulting data is of high-quality.

...read moreread less

Journal ArticleDOI

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Jiaxian Guo, +6 more

- 21 Dec 2022 -

arXiv.org

TL;DR: Img2Prompt as discussed by the authors is a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training.

...read moreread less

Posted Content

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends

Shagun Uppal, +6 more

- 19 Oct 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data as mentioned in this paper, which has enhanced research interests in the intersection of the vision and language arena with its numerous applications and fast-paced growth.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings ArticleDOI

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Kyunghyun Cho, +8 more

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less