scispace - formally typeset
Open AccessPosted Content

Scene Text Visual Question Answering

Reads0
Chats0
TLDR
A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.
Abstract
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

read more

Citations
More filters
Book ChapterDOI

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

TL;DR: Theoretically, the proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical.
Journal ArticleDOI

GIT: A Generative Image-to-text Transformer for Vision and Language

TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Posted Content

DocVQA: A Dataset for VQA on Document Images

TL;DR: Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy).
Posted Content

Text Recognition in the Wild: A Survey

TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Related Papers (5)