scispace - formally typeset
Search or ask a question
Book ChapterDOI

Towards Boosting the Accuracy of Non-latin Scene Text Recognition

TL;DR: In this article, the authors compare various features like the size (width and height) of the word images and word length statistics and discover that these factors are critical for the scene-text recognition systems.
Abstract: Scene-text recognition is remarkably better in Latin languages than the non-Latin languages due to several factors like multiple fonts, simplistic vocabulary statistics, updated data generation tools, and writing systems. This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages. We compare various features like the size (width and height) of the word images and word length statistics. Over the last decade, generating synthetic datasets with powerful deep learning techniques has tremendously improved scene-text recognition. Several controlled experiments are performed on English, by varying the number of (i) fonts to create the synthetic data and (ii) created word images. We discover that these factors are critical for the scene-text recognition systems. The English synthetic datasets utilize over 1400 fonts while Arabic and other non-Latin datasets utilize less than 100 fonts for data generation. Since some of these languages are a part of different regions, we garner additional fonts through a region-based search to improve the scene-text recognition models in Arabic and Devanagari. We improve the Word Recognition Rates (WRRs) on Arabic MLT-17 and MLT-19 datasets by \(24.54\%\) and \(2.32\%\) compared to previous works or baselines. We achieve WRR gains of \(7.88\%\) and \(3.72\%\) for IIIT-ILST and MLT-19 Devanagari datasets.
Citations
More filters
Journal ArticleDOI
TL;DR: This work investigates the significant differences in Indian and Latin Scene Text Recognition (STR) systems and presents utilizing additional non-Unicode fonts with generally employed Unicode fonts to cover font diversity in such synthesizers for Indian languages.
Abstract: Reading Indian scene texts is complex due to the use of regional vocabulary, multiple fonts/scripts, and text size. This work investigates the significant differences in Indian and Latin Scene Text Recognition (STR) systems. Recent STR works rely on synthetic generators that involve diverse fonts to ensure robust reading solutions. We present utilizing additional non-Unicode fonts with generally employed Unicode fonts to cover font diversity in such synthesizers for Indian languages. We also perform experiments on transfer learning among six different Indian languages. Our transfer learning experiments on synthetic images with common backgrounds provide an exciting insight that Indian scripts can benefit from each other than from the extensive English datasets. Our evaluations for the real settings help us achieve significant improvements over previous methods on four Indian languages from standard datasets like IIIT-ILST, MLT-17, and the new dataset (we release) containing 440 scene images with 500 Gujarati and 2535 Tamil words. Further enriching the synthetic dataset with non-Unicode fonts and multiple augmentations helps us achieve a remarkable Word Recognition Rate gain of over 33% on the IIIT-ILST Hindi dataset. We also present the results of lexicon-based transcription approaches for all six languages.

1 citations

TL;DR: This thesis looks at all the parameters involved in the process of text recognition and determines the importance of those parameters through thorough experiments and proposes an error correction module for correcting the labels by utilizing the training data of real test datasets.
Abstract: Text recognition has been an active field in computer vision even before the beginning of the deep learning era. Due to the varied applications of recognition models, the research area has been classified into diverse categories based on the domain of the data used. Optical character recognition (OCR) is focused on scanned documents, whereas images with natural scenes and much complex backgrounds fall into the category of scene text recognition. Scene text recognition has become an exciting area of research due to the complexities and difficulties such as complex backgrounds, improper illumination, distorted images with noise, inconsistent usage of fonts and font sizes that are not usually horizontally aligned. Such cases make the task of scene text recognition more complicated and challenging. In recent years, we have observed the rise of deep learning. Subsequently, there has been an incremental growth in the recognition algorithms and datasets available for training and testing purposes. This surge has caused the performance of recognizing text in natural scenes to rise above the baseline models that were previously trained using hand-crafted features. Latin texts were the center of attention in most of these works and did not profoundly investigate the field of scene text recognition for non-Latin languages. Upon scrutiny, we observe that the performance of the current best recognition models has reached above 90% over scene text benchmark datasets. However, these recognition models do not perform as well on non-Latin languages as they did on Latin (or English) datasets. This striking difference in the performances over different languages is a rising concern among the researchers focusing on lowresource languages, and it is indeed the motivation behind our work. Scene text recognition in low-resource non-Latin languages is difficult and challenging due to the inherent complex scripts, multiple writing systems, various fonts and orientations. Despite such differences, we can also achieve Latin (English) text-like performance for low-resource non-Latin languages. In this thesis, we look at all the parameters involved in the process of text recognition and determine the importance of those parameters through thorough experiments. We use synthetic data for controlled experiments where we test the parameters as mentioned earlier in an isolated fashion to effectively identify the catalysts of text recognition. We analyse the complexity of the scripts via these synthetic data experiments. We present the results of our experiments on two baseline models, CRNN and STAR-Net models, on available datasets to ensure generalisability. In addition to this, we also propose an error correction module for correcting the labels by utilizing the training data of real test datasets.
References
More filters
Proceedings Article
23 Feb 2016
TL;DR: In this paper, the authors show that training with residual connections accelerates the training of Inception networks significantly, and they also present several new streamlined architectures for both residual and non-residual Inception Networks.
Abstract: Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge

6,761 citations

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, and achieved remarkable performances in both lexicon free and lexicon-based scene text recognition tasks.
Abstract: Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

2,184 citations

Proceedings ArticleDOI
25 Aug 2013
TL;DR: The datasets and ground truth specification are described, the performance evaluation protocols used are details, and the final results are presented along with a brief summary of the participating methods.
Abstract: This report presents the final results of the ICDAR 2013 Robust Reading Competition. The competition is structured in three Challenges addressing text extraction in different application domains, namely born-digital images, real scene images and real-scene videos. The Challenges are organised around specific tasks covering text localisation, text segmentation and word recognition. The competition took place in the first quarter of 2013, and received a total of 42 submissions over the different tasks offered. This report describes the datasets and ground truth specification, details the performance evaluation protocols used and presents the final results along with a brief summary of the participating methods.

1,191 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, a Fully-Convolutional Regression Network (FCRN) was proposed to perform text detection and bounding-box regression at all locations and multiple scales in an image.
Abstract: In this paper we introduce a new method for text detection in natural images. The method comprises two contributions: First, a fast and scalable engine to generate synthetic images of text in clutter. This engine overlays synthetic text to existing background images in a natural way, accounting for the local 3D scene geometry. Second, we use the synthetic images to train a Fully-Convolutional Regression Network (FCRN) which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. We discuss the relation of FCRN to the recently-introduced YOLO detector, as well as other end-toend object detection systems based on deep learning. The resulting detection network significantly out performs current methods for text detection in natural images, achieving an F-measure of 84.2% on the standard ICDAR 2013 benchmark. Furthermore, it can process 15 images per second on a GPU.

1,142 citations

Proceedings ArticleDOI
06 Nov 2011
TL;DR: While scene text recognition has generally been treated with highly domain-specific methods, the results demonstrate the suitability of applying generic computer vision methods.
Abstract: This paper focuses on the problem of word detection and recognition in natural images. The problem is significantly more challenging than reading text in scanned documents, and has only recently gained attention from the computer vision community. Sub-components of the problem, such as text detection and cropped image word recognition, have been studied in isolation [7, 4, 20]. However, what is unclear is how these recent approaches contribute to solving the end-to-end problem of word recognition. We fill this gap by constructing and evaluating two systems. The first, representing the de facto state-of-the-art, is a two stage pipeline consisting of text detection followed by a leading OCR engine. The second is a system rooted in generic object recognition, an extension of our previous work in [20]. We show that the latter approach achieves superior performance. While scene text recognition has generally been treated with highly domain-specific methods, our results demonstrate the suitability of applying generic computer vision methods. Adopting this approach opens the door for real world scene text recognition to benefit from the rapid advances that have been taking place in object recognition.

1,074 citations