TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

doi:10.1109/CVPR46437.2021.00869

Open AccessProceedings ArticleDOI

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

- pp 8802-8812

TLDR

TextOCR as discussed by the authors is an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset, which can do scene text based reasoning on an image in an end-to-end fashion.

Abstract:

A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. The current systems are crippled by the unavailability of ground truth text annotations for these datasets as well as lack of scene text detection and recognition datasets on real images disallowing the progress in the field of OCR and evaluation of scene text based reasoning in isolation from OCR systems. In this work, we propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR and that training on TextOCR helps achieve state-of-the-art performance on multiple other OCR datasets as well. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion, allowing us to revisit several design choices to achieve new state-of-the-art performance on TextVQA dataset.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

- 01 Jan 2022 -

IEEE Access

TL;DR: In this article , the authors divide the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) rather than marketing or hardware approach to conduct a comprehensive analysis.

...read moreread less

Journal ArticleDOI

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

Sang-Min Park, +1 more

IEEE Access

TL;DR: This paper divides the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) and three approaches and describes essential methods based on three components and techniques to Metaverse’s representative Ready Player One, Roblox, and Facebook research in the domain of films, games, and studies.

...read moreread less

Journal ArticleDOI

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, +8 more

TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classiﬁcation and scene text recognition, achieving decent performance on standard benchmarks.

...read moreread less

Posted Content

Structured Multimodal Attentions for TextVQA

Chenyu Gao, +6 more

- 01 Jun 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.

...read moreread less

Proceedings ArticleDOI

A Multiplexed Network for End-to-End, Multilingual OCR

Jing Huang, +7 more

TL;DR: This paper proposed an end-to-end training pipeline that includes both detection and recognition, and achieved state-of-the-art results on both text detection and script identification benchmarks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Collapse

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Citations

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

GIT: A Generative Image-to-text Transformer for Vision and Language

Structured Multimodal Attentions for TextVQA

A Multiplexed Network for End-to-End, Multilingual OCR

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

ImageNet: A large-scale hierarchical image database

Glove: Global Vectors for Word Representation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Related Papers (5)

PhotoOCR: Reading Text in Uncontrolled Conditions

Dictionary-guided Scene Text Recognition

WordSup: Exploiting Word Annotations for Character Based Text Detection

Dynamic Lexicon Generation for Natural Scene Images

Deep Residual Learning for Image Recognition