Scene Text Visual Question Answering

doi:10.1109/ICCV.2019.00439

Open AccessProceedings ArticleDOI

Scene Text Visual Question Answering

Ali Furkan Biten, +7 more

- pp 4291-4301

Chats0

TLDR

The ST-VQA dataset as discussed by the authors proposes a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer.

Abstract:

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

Citations

PDF

Open Access

More filters

Book ChapterDOI

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Xiaoyu Yue, +4 more

TL;DR: Theoretically, the proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical.

...read moreread less

Proceedings ArticleDOI

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

Ronghang Hu, +3 more

TL;DR: Li et al. as mentioned in this paper propose a multimodal transformer architecture accompanied by a rich representation for text in images, which naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter-and intra-modality context.

...read moreread less

Posted Content

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, +2 more

- 01 Jul 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy).

...read moreread less

Posted Content

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Ronghang Hu, +3 more

- 14 Nov 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.

...read moreread less

Posted Content

Text Recognition in the Wild: A Survey

Xiaoxue Chen, +4 more

- 07 May 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Collapse

International Journal of Computer Vision

Scene Text Visual Question Answering

Citations

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

DocVQA: A Dataset for VQA on Document Images

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Text Recognition in the Wild: A Survey

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

VQA: Visual Question Answering

Attention is All you Need

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations