scispace - formally typeset

Proceedings ArticleDOI

Script based text identification: a multi-level architecture

17 Sep 2011-pp 11

TL;DR: The proposed framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages by utilizing texture and shape based information embedded in the documents at different levels for feature extraction.

AbstractScript identification in a multi-lingual document environment has numerous applications in the field of document image analysis, such as indexing and retrieval or as an initial step towards optical character recognition. In this paper, we propose a novel hierarchical framework for script identification in bi-lingual documents. The framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages. We utilize texture and shape based information embedded in the documents at different levels for feature extraction. The prediction task at different levels of hierarchy is performed by Support Vector Machine (SVM) and Rejection based classifier defined using AdaBoost. Experimental evaluation of the proposed concept on document collections of Hindi/English and Bangla/English scripts have shown promising results.

...read more


Citations
More filters
Journal ArticleDOI
TL;DR: Various feature extraction and classification techniques associated with the OSI of the Indic scripts are discussed in this survey and it is hoped that this survey will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India.
Abstract: Offline Script Identification (OSI) facilitates many important applications such as automatic archiving of multilingual documents, searching online/offline archives of document images and for the selection of script specific Optical Character Recognition (OCR) in a multilingual environment. In a multilingual country like India, a document containing text words in more than one language is a common scenario. A state-of-the-art survey about the techniques available in the area of OSI for Indic scripts would be of a great aid to the researchers. Hence, a sincere attempt is made in this article to discuss the advancements reported in the literature during the last few decades. Various feature extraction and classification techniques associated with the OSI of the Indic scripts are discussed in this survey. We hope that this survey will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India. It will also help to accomplish a target of bringing the researchers working on different Indic scripts together. Taking the recent developments in OSI of Indian regional scripts into consideration, this article will provide a better platform for future research activities.

42 citations


Cites background from "Script based text identification: a..."

  • ...12 – Architecture of the proposed work described in [34]....

    [...]

  • ...[34] structural features SVM And AdaBoost Hindi, English, Bangla Printed Page level, Text line level, Word level 98....

    [...]

  • ...[34] proposed a novel hierarchical framework for script identification in bi-lingual printed documents....

    [...]

  • ...13 – Hierarchical classifier for word level script identification [34]....

    [...]

Journal ArticleDOI
TL;DR: This paper addresses three key challenges here: collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, and development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor.
Abstract: Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a multilingual country like India, handwritten documents often contain more than one script, leading to complex pattern analysis problems. In this paper, we highlight two such situations where Devanagari and Bangla scripts, two most widely used scripts in Indian sub-continent, are individually used along with Roman script in documents. We address three key challenges here: 1) collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, 2) script-level annotation of 18931 Bangla words, 15528 Devanagari words and 10331 Roman words in those 300 document pages, and 3) development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor. The technique is statistically validated using multiple classifiers and it is found that Multi-Layer Perceptron (MLP) classifier performs the best. Average word-level script identification accuracies of 92.32%, 95.30% and 93.78% are achieved using 3-fold cross validation for Bangla-Roman, Devanagari-Roman and Bangla-Devanagari-Roman databases respectively. Both the mixed-script document databases along with the script-level annotations and 44790 extracted word images of the three aforementioned scripts are available freely at https://code.google.com/p/cmaterdb/ .

21 citations

Journal ArticleDOI
01 Oct 2018
TL;DR: A new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic first pre-processes and then classifies textual imaged documents into predefined categories.
Abstract: In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.

11 citations

Journal ArticleDOI
TL;DR: This article proposes a system that performs better than existing systems, and shows the results of experiments on this and other proposed systems.
Abstract: This article proposes a bi-leveled image classification system to classify printed and handwritten English documents into mutually exclusive predefined categories. The proposed system follows the steps of preprocessing, segmentation, feature extraction, and SVM based character classification at level 1, and word association and fuzzy matching based document classification at level 2. The system architecture and its modular structure discuss various task stages and their functionalities. Further, a case study on document classification is discussed to show the internal score computations of words and keywords with fuzzy matching. The experiments on proposed system illustrate that the system achieves promising results in the time-efficient manner and achieves better accuracy with less computation time for printed documents than handwritten ones. Finally, the performance of the proposed system is compared with the existing systems and it is observed that proposed system performs better than many other systems. KeywoRDS Confidence Computation, Document Image, Fuzzy Matching, Handwritten Documents, Performance Analysis, Printed Documents, SVM, Text Image Classification, Word Association

5 citations

Journal ArticleDOI
TL;DR: A journey of bilingual NLP and image-based document classification systems is discussed and an overview of their methods, feature extraction techniques, document sets, classifiers, and accuracy for English-Hindi and other language pairs is provided.
Abstract: Today, rapid digitization requires efficient bilingual non-image and image document classification systems. Although many bilingual NLP and image-based systems provide solutions for real-world problems, they primarily focus on text extraction, identification, and recognition tasks with limited document types. This article discusses a journey of these systems and provides an overview of their methods, feature extraction techniques, document sets, classifiers, and accuracy for English-Hindi and other language pairs. The gaps found lead toward the idea of a generic and integrated bilingual English-Hindi document classification system, which classifies heterogeneous documents using a dual class feeder and two character corpora. Its non-image and image modules include pre- and post-processing stages and pre-and post-segmentation stages to classify documents into predefined classes. This article discusses many real-life applications on societal and commercial issues. The analytical results show important findings of existing and proposed systems.

3 citations


References
More filters
Journal ArticleDOI
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

12,467 citations

Proceedings ArticleDOI
07 Jul 2001
TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algo- rithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a "cascade" which allows back- ground regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection perfor- mance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

10,155 citations

BookDOI
01 Dec 2001
TL;DR: Learning with Kernels provides an introduction to SVMs and related kernel methods that provide all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms.
Abstract: From the Publisher: In the 1990s, a new type of learning algorithm was developed, based on results from statistical learning theory: the Support Vector Machine (SVM). This gave rise to a new class of theoretically elegant learning machines that use a central concept of SVMs—-kernels--for a number of learning tasks. Kernel machines provide a modular framework that can be adapted to different tasks and domains by the choice of the kernel function and the base algorithm. They are replacing neural networks in a variety of fields, including engineering, information retrieval, and bioinformatics. Learning with Kernels provides an introduction to SVMs and related kernel methods. Although the book begins with the basics, it also includes the latest research. It provides all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms and to understand and apply the powerful algorithms that have been developed over the last few years.

7,786 citations

Journal ArticleDOI
TL;DR: Rotation invariant texture features are computed based on an extension of the popular multi-channel Gabor filtering technique, and their effectiveness is tested with 300 randomly rotated samples of 15 Brodatz textures to solve a practical but hitherto mostly overlooked problem in document image processing.
Abstract: Concerns the extraction of rotation invariant texture features and the use of such features in script identification from document images Rotation invariant texture features are computed based on an extension of the popular multi-channel Gabor filtering technique, and their effectiveness is tested with 300 randomly rotated samples of 15 Brodatz textures These features are then used in an attempt to solve a practical but hitherto mostly overlooked problem in document image processing-the identification of the script of a machine printed document Automatic script and language recognition is an essential front-end process for the efficient and correct use of OCR and language translation products in a multilingual environment Six languages (Chinese, English, Greek, Russian, Persian, and Malayalam) are chosen to demonstrate the potential of such a texture-based approach in script identification

290 citations