scispace - formally typeset
Open AccessProceedings ArticleDOI

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

TLDR
TextOCR as discussed by the authors is an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset, which can do scene text based reasoning on an image in an end-to-end fashion.
Abstract
A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. The current systems are crippled by the unavailability of ground truth text annotations for these datasets as well as lack of scene text detection and recognition datasets on real images disallowing the progress in the field of OCR and evaluation of scene text based reasoning in isolation from OCR systems. In this work, we propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR and that training on TextOCR helps achieve state-of-the-art performance on multiple other OCR datasets as well. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion, allowing us to revisit several design choices to achieve new state-of-the-art performance on TextVQA dataset.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

- 01 Jan 2022 - 
TL;DR: In this article , the authors divide the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) rather than marketing or hardware approach to conduct a comprehensive analysis.
Journal ArticleDOI

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

TL;DR: This paper divides the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) and three approaches and describes essential methods based on three components and techniques to Metaverse’s representative Ready Player One, Roblox, and Facebook research in the domain of films, games, and studies.
Journal ArticleDOI

GIT: A Generative Image-to-text Transformer for Vision and Language

TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Posted Content

Structured Multimodal Attentions for TextVQA

TL;DR: An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.
Proceedings ArticleDOI

A Multiplexed Network for End-to-End, Multilingual OCR

TL;DR: This paper proposed an end-to-end training pipeline that includes both detection and recognition, and achieved state-of-the-art results on both text detection and script identification benchmarks.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Related Papers (5)