scispace - formally typeset
Open AccessProceedings ArticleDOI

Focusing Attention: Towards Accurate Text Recognition in Natural Images

Reads0
Chats0
TLDR
Zhang et al. as mentioned in this paper proposed Focusing Attention Network (FAN) which employs a focusing attention mechanism to automatically draw back the drifted attention. But the FAN method is not suitable for complex and low-quality images and it cannot get accurate alignment between feature areas and targets for such images.
Abstract
Scene text recognition has been a hot research topic in computer vision due to its various applications. The state of the art is the attention-based encoder-decoder framework that learns the mapping between input images and output sequences in a purely data-driven way. However, we observe that existing attention-based methods perform poorly on complicated and/or low-quality images. One major reason is that existing methods cannot get accurate alignments between feature areas and targets for such images. We call this phenomenon “attention drift”. To tackle this problem, in this paper we propose the FAN (the abbreviation of Focusing Attention Network) method that employs a focusing attention mechanism to automatically draw back the drifted attention. FAN consists of two major components: an attention network (AN) that is responsible for recognizing character targets as in the existing methods, and a focusing network (FN) that is responsible for adjusting attention by evaluating whether AN pays attention properly on the target areas in the images. Furthermore, different from the existing methods, we adopt a ResNet-based network to enrich deep representations of scene text images. Extensive experiments on various benchmarks, including the IIIT5k, SVT and ICDAR datasets, show that the FAN method substantially outperforms the existing methods.

read more

Citations
More filters
Journal ArticleDOI

ASTER: An Attentional Scene Text Recognizer with Flexible Rectification

TL;DR: This work introduces ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network that predicts a character sequence directly from the rectified image.
Posted Content

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

TL;DR: This paper investigates the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images, and proposes an end-to-end trainable neural network model, named as Mask TextSpotter, which is inspired by the newly published work Mask R-CNN.
Proceedings ArticleDOI

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

TL;DR: In this paper, a unified four-stage scene text recognition (STR) framework is introduced to compare the performance of different models. But, the performance gap results from inconsistencies in the training and evaluation datasets.
Proceedings ArticleDOI

ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification

TL;DR: Li et al. as discussed by the authors proposed an end-to-end trainable scene text recognition system (ESIR) that iteratively removes perspective distortion and text line curvature as driven by better text recognition performance.
Proceedings ArticleDOI

AON: Towards Arbitrarily-Oriented Text Recognition

TL;DR: The arbitrary orientation network (AON) is developed to directly capture the deep features of irregular texts, which are combined into an attention-based decoder to generate character sequence and is comparable to major existing methods in regular datasets.
References
More filters
Proceedings ArticleDOI

Speech recognition with deep recurrent neural networks

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Posted Content

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler
- 22 Dec 2012 - 
TL;DR: A novel per-dimension learning rate method for gradient descent called ADADELTA that dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent is presented.
Journal ArticleDOI

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed a novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, and achieved remarkable performances in both lexicon free and lexicon-based scene text recognition tasks.
Proceedings Article

Attention-based models for speech recognition

TL;DR: The authors proposed a location-aware attention mechanism for the TIMET phoneme recognition task, which achieved an improved 18.7% phoneme error rate (PER) on utterances which are roughly as long as the ones it was trained on.
Related Papers (5)