Scene Text Visual Question Answering

Open AccessPosted Content

Scene Text Visual Question Answering

Ali Furkan Biten, +7 more

- 31 May 2019 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.

Abstract:

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

Citations

PDF

Open Access

More filters

Book ChapterDOI

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Xiaoyu Yue, +4 more

TL;DR: Theoretically, the proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical.

...read moreread less

Journal ArticleDOI

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, +8 more

TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classiﬁcation and scene text recognition, achieving decent performance on standard benchmarks.

...read moreread less

Posted Content

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, +2 more

- 01 Jul 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy).

...read moreread less

Posted Content

Text Recognition in the Wild: A Survey

Xiaoxue Chen, +4 more

- 07 May 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less