Recognizing Text with Perspective Distortion in Natural Scenes

doi:10.1109/ICCV.2013.76

Home
/
Papers
/
Recognizing Text with Perspective Distortion in Natural Scenes

Proceedings Article•DOI•

Recognizing Text with Perspective Distortion in Natural Scenes

Trung Quy Phan¹, Palaiahnakote Shivakumara², Shangxuan Tian¹, Chew Lim Tan¹•Institutions (2)

National University of Singapore¹, University of Malaya²

01 Dec 2013-pp 569-576

TL;DR: This paper introduces a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints and significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.

read less

Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-key points approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Text Detection and Recognition in Imagery: A Survey

[...]

Qixiang Ye, David Doermann¹•Institutions (1)

University of Maryland, College Park¹

01 Jul 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This review provides a fundamental comparison and analysis of the remaining problems in the field and summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems.

...read moreread less

Abstract: This paper analyzes, compares, and contrasts technical challenges, methods, and the performance of text detection and recognition research in color imagery It summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems Existing techniques are categorized as either stepwise or integrated and sub-problems are highlighted including text localization, verification, segmentation and recognition Special issues associated with the enhancement of degraded text and the processing of video text, multi-oriented, perspectively distorted and multilingual text are also addressed The categories and sub-categories of text are illustrated, benchmark datasets are enumerated, and the performance of the most representative approaches is compared This review provides a fundamental comparison and analysis of the remaining problems in the field

...read moreread less

709 citations

Additional excerpts

...[191], Phan et al....
[...]

Proceedings Article•DOI•

Robust Scene Text Recognition with Automatic Rectification

[...]

Baoguang Shi¹, Xinggang Wang¹, Pengyuan Lyu¹, Cong Yao¹, Xiang Bai¹ - Show less +1 more•Institutions (1)

Huazhong University of Science and Technology¹

12 Mar 2016

TL;DR: This article proposed a robust text recognizer with automatic rectification (RARE), which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN).

...read moreread less

Abstract: Recognizing text in natural images is a challenging task with many unsolved problems. Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. RARE is a speciallydesigned deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more "readable" image for the following SRN, which recognizes text through a sequence recognition approach. We show that the model is able to recognize several types of irregular text, including perspective text and curved text. RARE is end-to-end trainable, requiring only images and associated text labels, making it convenient to train and deploy the model in practical systems. State-of-the-art or highly-competitive performance achieved on several benchmarks well demonstrates the effectiveness of the proposed model.

...read moreread less

606 citations

Journal Article•DOI•

ASTER: An Attentional Scene Text Recognizer with Flexible Rectification

[...]

Baoguang Shi¹, Mingkun Yang¹, Xinggang Wang¹, Pengyuan Lyu¹, Cong Yao, Xiang Bai¹ - Show less +2 more•Institutions (1)

Huazhong University of Science and Technology¹

01 Sep 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work introduces ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network that predicts a character sequence directly from the rectified image.

...read moreread less

Abstract: A challenging aspect of scene text recognition is to handle text with distortions or irregular layout. In particular, perspective text and curved text are common in natural scenes and are difficult to recognize. In this work, we introduce ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network. The rectification network adaptively transforms an input image into a new one, rectifying the text in it. It is powered by a flexible Thin-Plate Spline transformation which handles a variety of text irregularities and is trained without human annotations. The recognition network is an attentional sequence-to-sequence model that predicts a character sequence directly from the rectified image. The whole model is trained end to end, requiring only images and their groundtruth text. Through extensive experiments, we verify the effectiveness of the rectification and demonstrate the state-of-the-art recognition performance of ASTER. Furthermore, we demonstrate that ASTER is a powerful component in end-to-end recognition systems, for its ability to enhance the detector.

...read moreread less

592 citations

Cites background from "Recognizing Text with Perspective D..."

...SVT-Perspective (SVTP) is proposed in [49] for evaluating the performance of recognizing perspective text....
[...]
...1, typical cases include oriented text, perspective text [49], and curved text....
[...]

Journal Article•DOI•

Scene text detection and recognition: recent advances and future trends

[...]

Yingying Zhu¹, Cong Yao¹, Xiang Bai¹•Institutions (1)

Huazhong University of Science and Technology¹

01 Feb 2016-Frontiers of Computer Science

TL;DR: This literature review can serve as a good reference for researchers in the areas of scene text detection and recognition and identify state-of-the-art algorithms, and predict potential research directions in the future.

...read moreread less

Abstract: Text, as one of the most influential inventions of humanity, has played an important role in human life, so far from ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications, therefore text detection and recognition in natural scenes have become important and active research topics in computer vision and document analysis. Especially in recent years, the community has seen a surge of research efforts and substantial progresses in these fields, though a variety of challenges (e.g. noise, blur, distortion, occlusion and variation) still remain. The purposes of this survey are three-fold: 1) introduce up-to-date works, 2) identify state-of-the-art algorithms, and 3) predict potential research directions in the future. Moreover, this paper provides comprehensive links to publicly available resources, including benchmark datasets, source codes, and online demos. In summary, this literature review can serve as a good reference for researchers in the areas of scene text detection and recognition.

...read moreread less

369 citations

Posted Content•

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

[...]

Pengyuan Lyu¹, Minghui Liao¹, Cong Yao, Wenhao Wu, Xiang Bai¹ - Show less +1 more•Institutions (1)

Huazhong University of Science and Technology¹

06 Jul 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper investigates the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images, and proposes an end-to-end trainable neural network model, named as Mask TextSpotter, which is inspired by the newly published work Mask R-CNN.

...read moreread less

Abstract: Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.

...read moreread less

326 citations

Cites methods from "Recognizing Text with Perspective D..."

...are provided by the Task 4.3 of the ICDAR 2015 competition [36]. The images are taken by Google glasses incidentally. There is a large portion of oriented text in this dataset. SVT-Perspective (SVTP) [59] is similar to SVT. However, most of them are distorted by perspective transformation. The test set contains 639 cropped images. A 50-word lexicon is provided for each image. CUTE80 (CUTE) [62] consis...
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Visual categorization with bags of keypoints

[...]

Gabriela Csurka

01 Jan 2004

TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.

...read moreread less

Abstract: We present a novel method for generic visual categorization: the problem of identifying the object content of natural images while generalizing across variations inherent to the object class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. We propose and compare two alternative implementations using different classifiers: Naive Bayes and SVM. The main advantages of the method are that it is simple, computationally efficient and intrinsically invariant. We present results for simultaneously classifying seven semantic visual categories. These results clearly demonstrate that the method is robust to background clutter and produces good categorization accuracy even without exploiting geometric information.

...read moreread less

5,046 citations

Journal Article•DOI•

Robust wide-baseline stereo from maximally stable extremal regions

[...]

Jiri Matas¹, Ondrej Chum, Martin Urban, Tomas Pajdla•Institutions (1)

University of Surrey¹

01 Sep 2004-Image and Vision Computing

TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

...read moreread less

3,422 citations

Proceedings Article•DOI•

Robust wide baseline stereo from maximally stable extremal regions

[...]

Jiri Matas¹, Ondrej Chum, Martin Urban, Tomas Pajdla•Institutions (1)

University of Surrey¹

01 Jan 2002

TL;DR: The wide-baseline stereo problem, i.e. the problem of establishing correspondences between a pair of images taken from different viewpoints, is studied and an efficient and practically fast detection algorithm is presented for an affinely-invariant stable subset of extremal regions, the maximally stable extremal region (MSER).

...read moreread less

Abstract: The wide-baseline stereo problem, i.e. the problem of establishing correspondences between a pair of images taken from different viewpoints is studied. A new set of image elements that are put into correspondence, the so called extremal regions , is introduced. Extremal regions possess highly desirable properties: the set is closed under (1) continuous (and thus projective) transformation of image coordinates and (2) monotonic transformation of image intensities. An efficient (near linear complexity) and practically fast detection algorithm (near frame rate) is presented for an affinely invariant stable subset of extremal regions, the maximally stable extremal regions (MSER). A new robust similarity measure for establishing tentative correspondences is proposed. The robustness ensures that invariants from multiple measurement regions (regions obtained by invariant constructions from extremal regions), some that are significantly larger (and hence discriminative) than the MSERs, may be used to establish tentative correspondences. The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes. Significant change of scale (3.5×), illumination conditions, out-of-plane rotation, occlusion, locally anisotropic scale change and 3D translation of the viewpoint are all present in the test problems. Good estimates of epipolar geometry (average distance from corresponding points to the epipolar line below 0.09 of the inter-pixel distance) are obtained.

...read moreread less

3,400 citations

Journal Article•DOI•

A Comparison of Affine Region Detectors

[...]

Krystian Mikolajczyk¹, Tinne Tuytelaars², Cordelia Schmid³, Andrew Zisserman¹, Jiri Matas⁴, Frederik Schaffalitzky¹, Timor Kadir¹, L. Van Gool² - Show less +4 more•Institutions (4)

University of Oxford¹, Katholieke Universiteit Leuven², French Institute for Research in Computer Science and Automation³, Czech Technical University in Prague⁴

01 Nov 2005-International Journal of Computer Vision

TL;DR: A snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions to establish a reference test set of images and performance software so that future detectors can be evaluated in the same framework.

...read moreread less

Abstract: The paper gives a snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions. Six types of detectors are included: detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky and Zisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of `maximally stable extremal regions', proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999) and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of `salient regions', proposed by Kadir, Zisserman and Brady (2004). The performance is measured against changes in viewpoint, scale, illumination, defocus and image compression. The objective of this paper is also to establish a reference test set of images and performance software, so that future detectors can be evaluated in the same framework.

...read moreread less

3,359 citations

Journal Article•DOI•

A Novel Connectionist System for Unconstrained Handwriting Recognition

[...]

Alex Graves¹, Marcus Liwicki, Santiago Fernández², R. Bertolami, Horst Bunke, Jürgen Schmidhuber¹ - Show less +2 more•Institutions (2)

Information Technology University¹, Dalle Molle Institute for Artificial Intelligence Research²

01 May 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range bidirectional interdependencies, significantly outperforming a state-of-the-art HMM-based system.

...read moreread less

Abstract: Recognizing lines of unconstrained handwritten text is a challenging task. The difficulty of segmenting cursive or overlapping characters, combined with the need to exploit surrounding context, has led to low recognition rates for even the best current recognizers. Most recent progress in the field has been made either through improved preprocessing or through advances in language modeling. Relatively little work has been done on the basic recognition algorithms. Indeed, most systems rely on the same hidden Markov models that have been used for decades in speech and handwriting recognition, despite their well-known shortcomings. This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range bidirectional interdependencies. In experiments on two large unconstrained handwriting databases, our approach achieves word recognition accuracies of 79.7 percent on online data and 74.1 percent on offline data, significantly outperforming a state-of-the-art HMM-based system. In addition, we demonstrate the network's robustness to lexicon size, measure the individual influence of its hidden layers, and analyze its use of context. Last, we provide an in-depth discussion of the differences between the network and HMMs, suggesting reasons for the network's superior performance.

...read moreread less

1,686 citations

"Recognizing Text with Perspective D..." refers background in this paper

...If such words can be recognized, they can be used for a wide range of applications: content-based image retrieval, sign translation, intelligent driving assistance, and navigation aid for the visually-impaired and robots....
[...]