scispace - formally typeset
Proceedings ArticleDOI

Cascade Reasoning Network for Text-based Visual Question Answering

TLDR
A novel Cascade Reasoning Network (CRN) is proposed that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module that aims to explicitly model the connections and interactions between texts and visual concepts.
Abstract
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.

read more

Citations
More filters
Journal ArticleDOI

GIT: A Generative Image-to-text Transformer for Vision and Language

TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Posted Content

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

TL;DR: This paper proposes Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks, and builds a large-scale scene text-related imagetext dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1:4 million images with scene text.
Proceedings ArticleDOI

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

TL;DR: The authors proposed Text-Aware Pre-training (TAP) for text-VQA and text-Caption tasks, which explicitly incorporates scene text (generated from OCR engines) during pretraining.
Journal ArticleDOI

A survey of methods, datasets and evaluation metrics for visual question answering

TL;DR: This paper has discussed some of the core concepts used in VQA systems and presented a comprehensive survey of efforts in the past to address this problem, and discussed some new datasets developed in 2019 and 2020.
Journal ArticleDOI

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

TL;DR: 3D geometric information is introduced into a human-like spatial reasoning process to capture the contextual knowledge of key objects step-by-step and achieves state-of-the-art performance on TextVQA and ST-VQ a datasets.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Journal ArticleDOI

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
Related Papers (5)