Proceedings ArticleDOI
Cascade Reasoning Network for Text-based Visual Question Answering
Fen Liu,Guanghui Xu,Qi Wu,Qing Du,Wei Jia,Mingkui Tan +5 more
- pp 4060-4069
TLDR
A novel Cascade Reasoning Network (CRN) is proposed that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module that aims to explicitly model the connections and interactions between texts and visual concepts.Abstract:
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.read more
Citations
More filters
Journal ArticleDOI
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Yuan Gan,Zicheng Liu,Ce Liu,Lijuan Wang +8 more
TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Posted Content
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang,Yijuan Lu,Jianfeng Wang,Xi Yin,Dinei Florencio,Lijuan Wang,Cha Zhang,Lei Zhang,Jiebo Luo +8 more
TL;DR: This paper proposes Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks, and builds a large-scale scene text-related imagetext dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1:4 million images with scene text.
Proceedings ArticleDOI
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang,Yijuan Lu,Jianfeng Wang,Xi Yin,Dinei Florencio,Lijuan Wang,Cha Zhang,Lei Zhang,Jiebo Luo +8 more
TL;DR: The authors proposed Text-Aware Pre-training (TAP) for text-VQA and text-Caption tasks, which explicitly incorporates scene text (generated from OCR engines) during pretraining.
Journal ArticleDOI
A survey of methods, datasets and evaluation metrics for visual question answering
TL;DR: This paper has discussed some of the core concepts used in VQA systems and presented a comprehensive survey of efforts in the past to address this problem, and discussed some new datasets developed in 2019 and 2020.
Journal ArticleDOI
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering
TL;DR: 3D geometric information is introduced into a human-like spatial reasoning process to capture the contextual knowledge of key objects step-by-step and achieves state-of-the-art performance on TextVQA and ST-VQ a datasets.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Journal ArticleDOI
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Posted Content
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.