scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Text detection, recognition, and script identification in natural scene images: a Review

About: This article is published in International Journal of Multimedia Information Retrieval.The article was published on 2022-07-05. It has received 7 citations till now. The article focuses on the topics: Computer science & Computer science.
Citations
More filters
Proceedings ArticleDOI
19 Dec 2022
TL;DR: In this paper , an approach to automated identification of the accuracy requirements set in the detail drawing is proposed based on image analysis, which makes it possible to increase the efficiency of identifying accuracy requirements by comparing recognition results with standard values.
Abstract: The article develops an approach to automated identification of the accuracy requirements set in the detail drawing. A technique for recognizing accuracy requirements based on image analysis is proposed. The algorithm for identifying tolerances on linear sizes is based on classical text recognition algorithms. The advantage of the developed approach is its versatility. The effectiveness of recognizing tolerances on linear sizes does not depend on the options for setting and orientation of text entries in the drawing. A database of tolerances on linear sizes has been developed, which makes it possible to increase the efficiency of identifying accuracy requirements by comparing recognition results with standard values. The structure of a convolutional neural network for identify the symbols of tolerances of form, orientation, location and run-out, roughness, is proposed. This makes it possible to determine with high accuracy the area of requirements and improve identification performance
Journal ArticleDOI
TL;DR: In this article , a method based on training samples was proposed to classify the original continuous text through the artificial neural network algorithm, and the experimental results showed that good results have been achieved on the IMDB comment dataset.
Abstract: How to effectively identify these signals and data has become an urgent topic. The neural network model is a stochastic system composed of nonlinear neurons. Therefore, it has strong self adaptability and controllability. This paper proposes a method based on training samples. It classifies the original continuous text through the artificial neural network algorithm. This paper mainly uses experimental method and comparative method to analyze the accuracy, precision, recall rate, F value and its trend in training and the results under different models. The experimental results show that good results have been achieved on the IMDB comment dataset, and the accuracy rate is close to 89.4%.
Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors studied and analyzed the text detection and recognition system based on the dual-attention mechanism (DAM) under AIT, and discussed the text localization method, the generation of candidate text regions, and the key points of text extraction and recognition.
Abstract: With the rapid development of Internet and artificial intelligence technology (AIT), a large number of social media tools such as Facebook, Twitter and Instagram have emerged, which not only provide a wide communication platform for Internet users, but also generate tens of thousands of text information with rich emotions. In this paper, we study and analyze the text detection (TD) and recognition system (RS) based on the dual-attention mechanism (DAM) under AIT, and discuss the text localization method, the generation of candidate text regions, and the key points of text extraction and recognition; we introduce the DAM to train the model by two different types of feature maps to improve the TD and recognition performance.
Journal ArticleDOI
TL;DR: In this paper , an intelligent education system based on image recognition (IR) technology has achieved rapid development and application, and an intelligent IR and analysis system for classroom behavior will analyze and record the behavior action (BA) data in the whole classroom.
Abstract: With the continuous improvement of the education recording system and artificial intelligence technology, the intelligent education system based on image recognition (IR) technology has achieved rapid development and application. The intelligent IR and analysis system for classroom behavior will analyze and record the behavior action (BA) data in the whole classroom. This work establishes an intelligent IR system using a convolutional neural network (CNN)-based image behavior action recognition algorithm. This system identifies students' behavior actions in an English flipped classroom (FC) to reflect students’ motivation for the English classroom and to gauge the allure of the teacher's lecture to the students. The study's results will provide data reference in the subsequent assessment and analysis of the quality of college English teaching.
Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a multi-channel MSER (Maximally Stable Extreme Regions) method to fully consider color information in text detection, which separates the text area in the image from the complex background, effectively reducing the influence of the complex backgrounds and light on street sign text detection.
Abstract: The street sign text information from natural scenes usually exists in a complex background environment and is affected by natural light and artificial light. However, most of the current text detection algorithms do not effectively reduce the influence of light and do not make full use of the relationship between high-level semantic information and contextual semantic information in the feature extraction network when extracting features from images, and they are ineffective at detecting text in complex backgrounds. To solve these problems, we first propose a multi-channel MSER (Maximally Stable Extreme Regions) method to fully consider color information in text detection, which separates the text area in the image from the complex background, effectively reducing the influence of the complex background and light on street sign text detection. We also propose an enhanced feature pyramid network text detection method, which includes a feature pyramid route enhancement (FPRE) module and a high-level feature enhancement (HLFE) module. The two modules can make full use of the network’s low-level and high-level semantic information to enhance the network’s effectiveness in localizing text information and detecting text with different shapes, sizes, and inclined text. Experiments showed that the F-scores obtained by the method proposed in this paper on ICDAR 2015 (International Conference on Document Analysis and Recognition 2015) dataset, ICDAR2017-MLT (International Conference on Document Analysis and Recognition 2017- Competition on Multi-lingual scene text detection) dataset, and the Natural Scene Street Signs (NSSS) dataset constructed in this study are 89.5%, 84.5%, and 73.3%, respectively, which confirmed the performance advantage of the method proposed in street sign text detection.
References
More filters
Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Journal ArticleDOI
TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

33,301 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

28,225 citations

Journal ArticleDOI
TL;DR: There is a natural uncertainty principle between detection and localization performance, which are the two main goals, and with this principle a single operator shape is derived which is optimal at any scale.
Abstract: This paper describes a computational approach to edge detection. The success of the approach depends on the definition of a comprehensive set of goals for the computation of edge points. These goals must be precise enough to delimit the desired behavior of the detector while making minimal assumptions about the form of the solution. We define detection and localization criteria for a class of edges, and present mathematical forms for these criteria as functionals on the operator impulse response. A third criterion is then added to ensure that the detector has only one response to a single edge. We use the criteria in numerical optimization to derive detectors for several common image features, including step edges. On specializing the analysis to step edges, we find that there is a natural uncertainty principle between detection and localization performance, which are the two main goals. With this principle we derive a single operator shape which is optimal at any scale. The optimal detector has a simple approximate implementation in which edges are marked at maxima in gradient magnitude of a Gaussian-smoothed image. We extend this simple detector using operators of several widths to cope with different signal-to-noise ratios in the image. We present a general method, called feature synthesis, for the fine-to-coarse integration of information from operators at different scales. Finally we show that step edge detector performance improves considerably as the operator point spread function is extended along the edge.

28,073 citations