TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Amanpreet Singh,Guan Pang,Mandy Toh,Jing Huang,Wojciech Galuba,Tal Hassner +5 more
- pp 8802-8812
TLDR
TextOCR as discussed by the authors is an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset, which can do scene text based reasoning on an image in an end-to-end fashion.Abstract:
A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. The current systems are crippled by the unavailability of ground truth text annotations for these datasets as well as lack of scene text detection and recognition datasets on real images disallowing the progress in the field of OCR and evaluation of scene text based reasoning in isolation from OCR systems. In this work, we propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR and that training on TextOCR helps achieve state-of-the-art performance on multiple other OCR datasets as well. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion, allowing us to revisit several design choices to achieve new state-of-the-art performance on TextVQA dataset.read more
Citations
More filters
Journal ArticleDOI
A Metaverse: Taxonomy, Components, Applications, and Open Challenges
TL;DR: In this article , the authors divide the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) rather than marketing or hardware approach to conduct a comprehensive analysis.
Journal ArticleDOI
A Metaverse: Taxonomy, Components, Applications, and Open Challenges
TL;DR: This paper divides the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) and three approaches and describes essential methods based on three components and techniques to Metaverse’s representative Ready Player One, Roblox, and Facebook research in the domain of films, games, and studies.
Journal ArticleDOI
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Yuan Gan,Zicheng Liu,Ce Liu,Lijuan Wang +8 more
TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Posted Content
Structured Multimodal Attentions for TextVQA
TL;DR: An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.
Proceedings ArticleDOI
A Multiplexed Network for End-to-End, Multilingual OCR
Jing Huang,Guan Pang,Rama Kovvuri,Mandy Toh,Kevin J Liang,Praveen Krishnan,Xi Yin,Tal Hassner +7 more
TL;DR: This paper proposed an end-to-end training pipeline that includes both detection and recognition, and achieved state-of-the-art results on both text detection and script identification benchmarks.
References
More filters
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI
Glove: Global Vectors for Word Representation
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.