scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Simple and Effective Solution for Script Identification in the Wild

TL;DR: This work presents an approach for automatically identifying the script of the text localized in the scene images using an off-the-shelf classifier, which is efficient and requires very less labeled data.
Abstract: We present an approach for automatically identifying the script of the text localized in the scene images. Our approach is inspired by the advancements in mid-level features. We represent the text images using mid-level features which are pooled from densely computed local features. Once text images are represented using the proposed mid-level feature representation, we use an off-the-shelf classifier to identify the script of the text image. Our approach is efficient and requires very less labeled data. We evaluate the performance of our method on a recently introduced CVSI dataset, demonstrating that the proposed approach can correctly identify script of 96.70% of the text images. In addition, we also introduce and benchmark a more challenging Indian Language Scene Text (ILST) dataset for evaluating the performance of our method.
Citations
More filters
Proceedings ArticleDOI
01 Nov 2017
TL;DR: This paper presents the dataset, the tasks and the findings of this RRC-MLT challenge, which aims at assessing the ability of state-of-the-art methods to detect Multi-Lingual Text in scene images, such as in contents gathered from the Internet media and in modern cities where multiple cultures live and communicate together.
Abstract: Text detection and recognition in a natural environment are key components of many applications, ranging from business card digitization to shop indexation in a street. This competition aims at assessing the ability of state-of-the-art methods to detect Multi-Lingual Text (MLT) in scene images, such as in contents gathered from the Internet media and in modern cities where multiple cultures live and communicate together. This competition is an extension of the Robust Reading Competition (RRC) which has been held since 2003 both in ICDAR and in an online context. The proposed competition is presented as a new challenge of the RRC. The dataset built for this challenge largely extends the previous RRC editions in many aspects: the multi-lingual text, the size of the dataset, the multi-oriented text, the wide variety of scenes. The dataset is comprised of 18,000 images which contain text belonging to 9 languages. The challenge is comprised of three tasks related to text detection and script classification. We have received a total of 16 participations from the research and industrial communities. This paper presents the dataset, the tasks and the findings of this RRC-MLT challenge.

321 citations


Cites background from "A Simple and Effective Solution for..."

  • ...The previous editions of RRC competitions [1], [2] and other works [3], [4], [5], [6], [7], have provided useful datasets to help researchers tackle each of those problems in order to robustly read text in natural scene images....

    [...]

  • ...Despite the available datasets related to scene text detection or to script identification [2], [3], [4], [5], [6], [7], our dataset offers interesting novel aspects....

    [...]

Proceedings ArticleDOI
01 Sep 2019
TL;DR: The RRC-MLT-2019 challenge as discussed by the authors was the first edition of the multi-lingual scene text (MLT) detection and recognition challenge, which aims to systematically benchmark and push the state-of-the-art forward.
Abstract: With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

175 citations

Journal ArticleDOI
TL;DR: A novel method that involves extraction of local and global features using CNN-LSTM framework and weighting them dynamically for script identification is proposed and achieves superior results in comparison to conventional methods.

110 citations

Posted Content
TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.
Abstract: The history of text can be traced back over thousands of years. Rich and precise semantic information carried by text is important in a wide range of vision-based application scenarios. Therefore, text recognition in natural scenes has been an active research field in computer vision and pattern recognition. In recent years, with the rise and development of deep learning, numerous methods have shown promising in terms of innovation, practicality, and efficiency. This paper aims to (1) summarize the fundamental problems and the state-of-the-art associated with scene text recognition; (2) introduce new insights and ideas; (3) provide a comprehensive review of publicly available resources; (4) point out directions for future work. In summary, this literature review attempts to present the entire picture of the field of scene text recognition. It provides a comprehensive reference for people entering this field, and could be helpful to inspire future research. Related resources are available at our Github repository: this https URL.

72 citations


Cites background from "A Simple and Effective Solution for..."

  • ...Script identification can be interpreted as an image classification problem, where discriminative representations are usually designed, such as mid-level features [81], [82], convolutional features [83], [84], [85], and stroke-parts representations [86]....

    [...]

Journal ArticleDOI
TL;DR: A novel framework integrating Local CNN and Global CNN both of which are based on ResNet-20 for script identification is presented, which fully exploits the local features of the image, effectively revealing subtle differences among the scripts that are difficult to distinguish.
Abstract: Script identification in natural scene images is a key pre-step for text recognition and is also an indispensable condition for automatic text understanding systems that are designed for multi-language environments. In this paper, we present a novel framework integrating Local CNN and Global CNN both of which are based on ResNet-20 for script identification. We first obtain a lot of patches and segmented images based on the aspect ratios of the images. Subsequently, these patches and segmented images are used as inputs to Local CNN and Global CNN for training, respectively. Finally, to get the final results, the Adaboost algorithm is used to combine the results of Local CNN and Global CNN for decision-level fusion. Benefiting from such a strategy, Local CNN fully exploits the local features of the image, effectively revealing subtle differences among the scripts that are difficult to distinguish such as English, Greek, and Russian. Moreover, Global CNN mines the global features of the image to improve the accuracy of script identification. The experimental results demonstrate that our approach has a good performance on four public datasets.

43 citations

References
More filters
Journal ArticleDOI
TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Abstract: Presents a theoretically very simple, yet efficient, multiresolution approach to gray-scale and rotation invariant texture classification based on local binary patterns and nonparametric discrimination of sample and prototype distributions. The method is based on recognizing that certain local binary patterns, termed "uniform," are fundamental properties of local image texture and their occurrence histogram is proven to be a very powerful texture feature. We derive a generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis. The proposed approach is very robust in terms of gray-scale variations since the operator is, by definition, invariant against any monotonic transformation of the gray scale. Another advantage is computational simplicity as the operator can be realized with a few operations in a small neighborhood and a lookup table. Experimental results demonstrate that good discrimination can be achieved with the occurrence statistics of simple rotation invariant local binary patterns.

14,245 citations


"A Simple and Effective Solution for..." refers methods in this paper

  • ...We compare our methods with popular features used for script identifications in document images namely LBP [9], Gabor features [7]....

    [...]

  • ...Texture based features such as Gabor filter [7], LBP [9] have been used for script identification....

    [...]

  • ...67% which is significantly better than methods used in document image script identification domain such as [7, 9]....

    [...]

  • ...We compare our methods with popular features used for script identifications in document images namely LBP [9], Language Success Failure...

    [...]

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work seeks to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several types of coding modules and pooling schemes and shows how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding.
Abstract: Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter responses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pooling step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pooling schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the average, or the maximum), which obtains state-of-the-art performance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature extractors, our approach aims to facilitate the design of better recognition architectures.

1,177 citations


"A Simple and Effective Solution for..." refers background or methods in this paper

  • ...Mid-level features have achieved noticeable success in image classification and retrieval tasks [11, 12, 10]....

    [...]

  • ...Our method is inspired by recent advancements made in mid-level features [10, 11, 12]....

    [...]

Proceedings ArticleDOI
07 Sep 2009
TL;DR: A framework is presented that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary, and achieves significant improvement in word recognition accuracies without using a restricted word list.
Abstract: The problem of recognizing text in images taken in the wild has gained significant attention from the computer vision community in recent years. Contrary to recognition of printed documents, recognizing scene text is a challenging problem. We focus on the problem of recognizing text extracted from natural scene images and the web. Significant attempts have been made to address this problem in the recent past. However, many of these works benefit from the availability of strong context, which naturally limits their applicability. In this work we present a framework that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary. We show experimental results on publicly available datasets. Furthermore, we introduce a large challenging word dataset with five thousand words to evaluate various steps of our method exhaustively. The main contributions of this work are: (1) We present a framework, which incorporates higher order statistical language models to recognize words in an unconstrained manner (i.e. we overcome the need for restricted word lists, and instead use an English dictionary to compute the priors). (2) We achieve significant improvement (more than 20%) in word recognition accuracies without using a restricted word list. (3) We introduce a large word recognition dataset (atleast 5 times larger than other public datasets) with character level annotation and benchmark it.

789 citations


"A Simple and Effective Solution for..." refers background in this paper

  • ...Scene text understanding has gained huge attention in last decade, and several benchmark datasets has been introduced [13, 14]....

    [...]

Posted Content
TL;DR: In this paper, a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation is discovered. But these patches are not restricted to be any one of the parts, objects, visual phrases, etc.
Abstract: The goal of this paper is to discover a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation. The desired patches need to satisfy two requirements: 1) to be representative, they need to occur frequently enough in the visual world; 2) to be discriminative, they need to be different enough from the rest of the visual world. The patches could correspond to parts, objects, "visual phrases", etc. but are not restricted to be any one of them. We pose this as an unsupervised discriminative clustering problem on a huge dataset of image patches. We use an iterative procedure which alternates between clustering and training discriminative classifiers, while applying careful cross-validation at each step to prevent overfitting. The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks. Furthermore, discriminative patches can also be used in a supervised regime, such as scene classification, where they demonstrate state-of-the-art performance on the MIT Indoor-67 dataset.

539 citations

Proceedings Article
01 Feb 2009
TL;DR: It is demonstrated that the performance of the proposed method can be far superior to that of commercial OCR systems, and can benefit from synthetically generated training data obviating the need for expensive data collection and annotation.
Abstract: This paper tackles the problem of recognizing characters in images of natural scenes. In particular, we focus on recognizing characters in situations that would traditionally not be handled well by OCR techniques. We present an annotated database of images containing English and Kannada characters. The database comprises of images of street scenes taken in Bangalore, India using a standard camera. The problem is addressed in an object cateogorization framework based on a bag-of-visual-words representation. We assess the performance of various features based on nearest neighbour and SVM classification. It is demonstrated that the performance of the proposed method, using as few as 15 training images, can be far superior to that of commercial OCR systems. Furthermore, the method can benefit from synthetically generated training data obviating the need for expensive data collection and annotation.

520 citations


"A Simple and Effective Solution for..." refers background in this paper

  • ...Languages # scene images # word images Mode of collection Hindi 76 514 Authors, Google Images Malayalam 121 515 Authors, Google Images Kannada 115 534 Char74K [16] Tamil 59 563 Authors Telugu 79 510 Authors English 128 850 Authors total 578 3486 -...

    [...]