Open AccessPosted Content
Generating Question Relevant Captions to Aid Visual Question Answering
TLDR
This work presents a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question.Citations
More filters
Journal ArticleDOI
Multimodal research in vision and language: A review of current and emerging trends
Shagun Uppal,Sarthak Bhagat,Devamanyu Hazarika,Navonil Majumder,Soujanya Poria,Roger Zimmermann,Amir Zadeh +6 more
TL;DR: A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation.
Proceedings ArticleDOI
All You May Need for VQA are Image Captions
TL;DR: This paper proposes a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation, and shows that the resulting data is of high-quality.
Journal ArticleDOI
Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning
Erkut Erdem,Menekşe Kuyu,Semih Yagcioglu,Anette Frank,Letitia Parcalabescu,B. Plank,Andrii Babii,Oleksii Turuta,Aykut Erdem,Iacer Calixto,Elena Lloret,Elena Apostol,Ciprian-Octavian Truica,Branislava Šandrih,Sanda Martincic Ipsic,Gábor Berend,Albert Gatt,Gražina Korvel +17 more
TL;DR: This state-of-the-art report investigates the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies.
Journal ArticleDOI
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
TL;DR: Img2Prompt as discussed by the authors is a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training.
Posted Content
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal,Sarthak Bhagat,Devamanyu Hazarika,Navonil Majumdar,Soujanya Poria,Roger Zimmermann,Amir Zadeh +6 more
TL;DR: Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data as mentioned in this paper, which has enhanced research interests in the intersection of the vision and language arena with its numerous applications and fast-paced growth.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings ArticleDOI
Glove: Global Vectors for Word Representation
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings ArticleDOI
Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation
Kyunghyun Cho,Bart van Merriënboer,Caglar Gulcehre,Dzmitry Bahdanau,Fethi Bougares,Holger Schwenk,Yoshua Bengio,Yoshua Bengio,Yoshua Bengio +8 more
TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.