scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Hybrid Deep Neural Network for Urdu Text Recognition in Natural Images

05 Jul 2019-Vol. 2019, pp 321-325
TL;DR: A hybrid deep neural network architecture with skip connections, which combines convolutional and recurrent neural network, is proposed to recognize the Urdu scene text.
Abstract: In this work, we present a benchmark and a hybrid deep neural network for Urdu Text Recognition in natural scene images. Recognizing text in natural scene images is a challenging task, which has attracted the attention of computer vision and pattern recognition communities. In recent years, scene text recognition has widely been studied where; state-of-the-art results are achieved by using deep neural network models. However, most of the research works are performed for English text and a less concentration is given to other languages. In this paper, we investigate the problem of Urdu text recognition in natural scene images. Urdu is a type of cursive text written from right to left direction where, two or more characters are joined to form a word. Recognizing cursive text in natural images is considered an open problem due to variations in its representation. A hybrid deep neural network architecture with skip connections, which combines convolutional and recurrent neural network, is proposed to recognize the Urdu scene text. We introduce a new dataset of 11500 manually cropped Urdu word images from natural scenes and show the baseline results. The network is trained on the whole word image avoiding the traditional character based classification. Data augmentation technique with contrast stretching and histogram equalizer is used to further enhance the size of the dataset. The experimental results on original and augmented word images show state-of-the-art performance of the network.
Citations
More filters
Journal ArticleDOI
TL;DR: A segmentation-free method based on a deep convolutional recurrent neural network to solve the problem of cursive text recognition, particularly focusing on Urdu text in natural scenes, and results show that the proposed deep CRNN network with shortcut connections outperform than other network architectures.
Abstract: Text recognition in natural scene images is a challenging problem in computer vision. Different than the optical character recognition (OCR), text recognition in natural scene images is more complex due to variations in text size, colors, fonts, orientations, complex backgrounds, occlusion, illuminations and uneven lighting conditions. In this paper, we propose a segmentation-free method based on a deep convolutional recurrent neural network to solve the problem of cursive text recognition, particularly focusing on Urdu text in natural scenes. Compared to the non-cursive scripts, Urdu text recognition is more complex due to variations in the writing styles, several shapes of the same character, connected text, ligature overlapping, stretched, diagonal and condensed text. The proposed model gets a whole word image as an input without pre-segmenting into individual characters, and then transforms into the sequence of the relevant features. Our model is based on three components: a deep convolutional neural network (CNN) with shortcut connections to extract and encode the features, a recurrent neural network (RNN) to decode the convolutional features, and a connectionist temporal classification (CTC) to map the predicted sequences into the target labels. To increase the text recognition accuracy further, we explore deeper CNN architectures like VGG-16, VGG-19, ResNet-18 and ResNet-34 to extract more appropriate Urdu text features, and compare the recognition results. To conduct the experiments, a new large-scale benchmark dataset of cropped Urdu word images in natural scenes is developed. The experimental results show that the proposed deep CRNN network with shortcut connections outperform than other network architectures. The dataset is publicly available and can be downloaded from https://data.mendeley.com/datasets/k5fz57zd9z/1.

13 citations

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a new Urdu-text signboard dataset with 467 ligature categories, containing a 30'+'k images for recognition and 700 base images with annotation are created for detection.
Abstract: Signboard detection and recognition is an important task in automated context-aware marketing. Recently many scripting languages like Latin, Japanese, and Chinese have been effectively detected by several machine learning algorithms. As compared to other languages, outdoor Urdu text needs further attention in detection and recognition due to its cursive nature. Urdu detection and recognition are also difficult due to a wide variety of illuminations, low resolution, inconsistent font styles, color, and backgrounds. To overcome the deficiency of Urdu text detection from the outdoor environment, we have proposed a new Urdu-text signboard dataset with 467 ligature categories, containing a 30 + K images for recognition and 700 base images with annotation are created for detection. We also propose a methodology, that consists of 3-phases. In first phase text regions containing Urdu ligatures from shop-signboard images are detected by a faster regional convolutional neural network (FasterRCNN) using pre-trained CNNs like Alexnet and Vgg16. In the second phase detected regions from the first phase are clustered to identify unique ligatures in a dataset. Lastly in the third phase, all detected regions are recognized by 18-layer convolutional neural network trained model. The proposed system has successfully achieved the precision and recall of 87% and 96% respectively using vgg16 model for detection. For the classification of ligatures, a recognition rate of 97.50% is achieved. Recognition of ligatures was also evaluated using bilingual evaluation understudy (BLEU), and achieved an encouraging score of 0.96 on the newly developed Urdu-Signboard dataset.

9 citations

Journal ArticleDOI
TL;DR: In this article , the authors proposed a segmentation-free method based on a deep convolutional recurrent neural network to solve the problem of cursive text recognition, particularly focusing on Urdu text in natural scenes.
Abstract: Text recognition in natural scene images is a challenging problem in computer vision. Different than the optical character recognition (OCR), text recognition in natural scene images is more complex due to variations in text size, colors, fonts, orientations, complex backgrounds, occlusion, illuminations and uneven lighting conditions. In this paper, we propose a segmentation-free method based on a deep convolutional recurrent neural network to solve the problem of cursive text recognition, particularly focusing on Urdu text in natural scenes. Compared to the non-cursive scripts, Urdu text recognition is more complex due to variations in the writing styles, several shapes of the same character, connected text, ligature overlapping, stretched, diagonal and condensed text. The proposed model gets a whole word image as an input without pre-segmenting into individual characters, and then transforms into the sequence of the relevant features. Our model is based on three components: a deep convolutional neural network (CNN) with shortcut connections to extract and encode the features, a recurrent neural network (RNN) to decode the convolutional features, and a connectionist temporal classification (CTC) to map the predicted sequences into the target labels. To increase the text recognition accuracy further, we explore deeper CNN architectures like VGG-16, VGG-19, ResNet-18 and ResNet-34 to extract more appropriate Urdu text features, and compare the recognition results. To conduct the experiments, a new large-scale benchmark dataset of cropped Urdu word images in natural scenes is developed. The experimental results show that the proposed deep CRNN network with shortcut connections outperform than other network architectures. The dataset is publicly available and can be downloaded from https://data.mendeley.com/datasets/k5fz57zd9z/1 .

9 citations

Journal ArticleDOI
TL;DR: A new method is proposed based on a combination of various functions and the Convolutional Neural Network to solve the English character recognition pollution problem and provides multi-label image properties and the best authentication result for the window color channel.

4 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations


"A Hybrid Deep Neural Network for Ur..." refers methods in this paper

  • ...The network is trained using Adam optimizer [33] with a learning rate of 0....

    [...]

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations


"A Hybrid Deep Neural Network for Ur..." refers methods in this paper

  • ...The architecture of the network is based on a VGG-16 [27]....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Book ChapterDOI
08 Oct 2016
TL;DR: In this paper, the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.
Abstract: Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62 % error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https://github.com/KaimingHe/resnet-1k-layers.

7,398 citations