Cascade Reasoning Network for Text-based Visual Question Answering

doi:10.1145/3394171.3413924

Proceedings ArticleDOI

Cascade Reasoning Network for Text-based Visual Question Answering

- pp 4060-4069

TLDR

A novel Cascade Reasoning Network (CRN) is proposed that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module that aims to explicitly model the connections and interactions between texts and visual concepts.

Abstract:

We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.

Cascade Reasoning Network for Text-based Visual Question Answering

Citations

GIT: A Generative Image-to-text Transformer for Vision and Language

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

A survey of methods, datasets and evaluation metrics for visual question answering

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

References

Deep Residual Learning for Image Recognition

Attention is All you Need

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Related Papers (5)

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Towards VQA Models That Can Read

OCR-VQA: Visual Question Answering by Reading Text in Images

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering