Open AccessPosted Content
Scene Text Visual Question Answering
Ali Furkan Biten,Rubèn Tito,Andres Mafla,Lluis Gomez,Marçal Rusiñol,Ernest Valveny,C. V. Jawahar,Dimosthenis Karatzas +7 more
Reads0
Chats0
TLDR
A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.Abstract:
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.read more
Citations
More filters
Proceedings ArticleDOI
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen,Xiao Jing Wang,Soravit Changpinyo,AJ Piergiovanni,Piotr Padlewski,Daniel M Salz,Sebastian Goodman,Adam Grycner,Basil Mustafa,Lucas Beyer,Alexander Kolesnikov,Joan Puigcerver,Nan Ding,Keran Rong,Hassan Akbari,Gaurav Mishra,Linting Xue,Ashish V. Thapliyal,James Bradbury,Weicheng Kuo,Mojtaba Seyedhosseini,Chao Jia,Burcu Karagol Ayan,Carlos Riquelme,Andreas Steiner,Anelia Angelova,Xiaohua Zhai,Neil Houlsby,Radu Soricut +28 more
TL;DR: PaLI achieves state-of-the-art in multiple vision and language tasks, while retaining a simple, modular, and scalable design.
Book ChapterDOI
RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
TL;DR: Theoretically, the proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical.
Journal ArticleDOI
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Yuan Gan,Zicheng Liu,Ce Liu,Lijuan Wang +8 more
TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Posted Content
DocVQA: A Dataset for VQA on Document Images
TL;DR: Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy).
Posted Content
Text Recognition in the Wild: A Survey
TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI
Glove: Global Vectors for Word Representation
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Book ChapterDOI
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.