Script based text identification: a multi-level architecture

doi:10.1145/2034617.2034630

Home
/
Papers
/
Script based text identification: a multi-level architecture

Proceedings Article•DOI•

Script based text identification: a multi-level architecture

Ehtesham Hassan¹, Ritu Garg¹, Santanu Chaudhury¹, M. Gopal¹•Institutions (1)

Indian Institute of Technology Delhi¹

17 Sep 2011-pp 11

TL;DR: The proposed framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages by utilizing texture and shape based information embedded in the documents at different levels for feature extraction.

read less

Abstract: Script identification in a multi-lingual document environment has numerous applications in the field of document image analysis, such as indexing and retrieval or as an initial step towards optical character recognition. In this paper, we propose a novel hierarchical framework for script identification in bi-lingual documents. The framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages. We utilize texture and shape based information embedded in the documents at different levels for feature extraction. The prediction task at different levels of hierarchy is performed by Support Vector Machine (SVM) and Rejection based classifier defined using AdaBoost. Experimental evaluation of the proposed concept on document collections of Hindi/English and Bangla/English scripts have shown promising results.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Offline Script Identification from multilingual Indic-script documents: A state-of-the-art

[...]

Pawan Kumar Singh¹, Ram Sarkar¹, Mita Nasipuri¹•Institutions (1)

Jadavpur University¹

01 Feb 2015-Computer Science Review

TL;DR: Various feature extraction and classification techniques associated with the OSI of the Indic scripts are discussed in this survey and it is hoped that this survey will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India.

...read moreread less

42 citations

Cites background from "Script based text identification: a..."

...12 – Architecture of the proposed work described in [34]....
[...]
...[34] structural features SVM And AdaBoost Hindi, English, Bangla Printed Page level, Text line level, Word level 98....
[...]
...[34] proposed a novel hierarchical framework for script identification in bi-lingual printed documents....
[...]
...13 – Hierarchical classifier for word level script identification [34]....
[...]

Journal Article•DOI•

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

[...]

Pawan Kumar Singh¹, Ram Sarkar¹, Nibaran Das¹, Subhadip Basu¹, Mahantapas Kundu¹, Mita Nasipuri¹ - Show less +2 more•Institutions (1)

Jadavpur University¹

01 Apr 2018-Multimedia Tools and Applications

TL;DR: This paper addresses three key challenges here: collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, and development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor.

...read moreread less

Abstract: Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a multilingual country like India, handwritten documents often contain more than one script, leading to complex pattern analysis problems. In this paper, we highlight two such situations where Devanagari and Bangla scripts, two most widely used scripts in Indian sub-continent, are individually used along with Roman script in documents. We address three key challenges here: 1) collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, 2) script-level annotation of 18931 Bangla words, 15528 Devanagari words and 10331 Roman words in those 300 document pages, and 3) development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor. The technique is statistically validated using multiple classifiers and it is found that Multi-Layer Perceptron (MLP) classifier performs the best. Average word-level script identification accuracies of 92.32%, 95.30% and 93.78% are achieved using 3-fold cross validation for Bangla-Roman, Devanagari-Roman and Bangla-Devanagari-Roman databases respectively. Both the mixed-script document databases along with the script-level annotations and 44790 extracted word images of the three aforementioned scripts are available freely at https://code.google.com/p/cmaterdb/ .

...read moreread less

27 citations

Journal Article•DOI•

Hindi Text Document Classification System Using SVM and Fuzzy: A Survey

[...]

Shalini Puri¹, Satya Prakash Singh¹•Institutions (1)

Birla Institute of Technology and Science¹

01 Oct 2018

TL;DR: A new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic first pre-processes and then classifies textual imaged documents into predefined categories.

...read moreread less

Abstract: In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.

...read moreread less

15 citations

Journal Article•DOI•

A Hybrid Hindi Printed Document Classification System Using SVM and Fuzzy: An Advancement

[...]

Shalini Puri¹, Satya Prakash Singh¹•Institutions (1)

Birla Institute of Technology, Mesra¹

01 Oct 2019-Journal of Information Technology Research

TL;DR: A new advanced tri-layered segmentation and bi-leveled-classifier-based Hindi printed document classification system, which categorizes imaged documents into pre-defined mutually exclusive categories by using SVM and Fuzzy matching at character and document classifications, respectively.

...read moreread less

Abstract: This article introduces a new advanced tri-layered segmentation and bi-leveled-classifier-based Hindi printed document classification system, which categorizes imaged documents into pre-defined mutually exclusive categories by using SVM and Fuzzy matching at character and document classifications, respectively. During training, the improved and noise-free image is segmented into lines and words by profiling. Then it obtains Shirorekha Less (SL) isolated characters along with upper, left and right modifier components from the SL words. These components use their locations and inter character-modifier component distance to get associate with their corresponding characters only. Further, confidence values of all characters are calculated with SVM training and all characters are mapped into Romanized labels to generate the words. Finally, documents are classified by Fuzzy based matching of Romanized detected words and predefined classes. The average execution times of SL characters are 0.22675 sec. and 0.20375 sec. and classification accuracy are 74.61% and 80.73% for training and testing, respectively.

...read moreread less

10 citations

Journal Article•DOI•

Advanced Applications on Bilingual Document Analysis and Processing Systems

[...]

Shalini Puri¹, Satya Prakash Singh¹•Institutions (1)

Birla Institute of Technology, Mesra¹

01 Oct 2020-International Journal of Applied Metaheuristic Computing

TL;DR: A journey of bilingual NLP and image-based document classification systems is discussed and an overview of their methods, feature extraction techniques, document sets, classifiers, and accuracy for English-Hindi and other language pairs is provided.

...read moreread less

Abstract: Today, rapid digitization requires efficient bilingual non-image and image document classification systems. Although many bilingual NLP and image-based systems provide solutions for real-world problems, they primarily focus on text extraction, identification, and recognition tasks with limited document types. This article discusses a journey of these systems and provides an overview of their methods, feature extraction techniques, document sets, classifiers, and accuracy for English-Hindi and other language pairs. The gaps found lead toward the idea of a generic and integrated bilingual English-Hindi document classification system, which classifies heterogeneous documents using a dual class feeder and two character corpora. Its non-image and image modules include pre- and post-processing stages and pre-and post-segmentation stages to classify documents into predefined classes. This article discusses many real-life applications on societal and commercial issues. The analytical results show important findings of existing and proposed systems.

...read moreread less

8 citations

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Script Identification Based on Morphological Reconstruction in Document Images

[...]

B. V. Dhandra¹, P. Nagabhushan¹, Mallikarjun Hangarge¹, Ravindra S. Hegadi¹, V.S. Malemath¹ - Show less +1 more•Institutions (1)

Gulbarga University¹

20 Aug 2006

TL;DR: The study of script identification based on morphological reconstruction for printed document images is carried out by using 609-scanned document images representing English, Hindi, Kannada, and Urdu scripts using a feature extractor and a classifier.

...read moreread less

Abstract: In this paper, the study of script identification based on morphological reconstruction for printed document images is carried out. The system is developed by using 609-scanned document images representing English, Hindi, Kannada, and Urdu scripts. The system developed includes a feature extractor and a classifier. The feature extractor consists of two stages. In the first stage, the morphological erosion and opening by reconstruction is carried out on a document image in horizontal, vertical, right and left diagonal directions using the line structuring element. The length of the structuring element is fixed, based on the average height of all the connected components of an image. In the next stage, average pixel distribution is found in these resulting images. A nearest neighbor analysis is used to classify the new documents. Accuracy of classification averaged 97% across the four scripts. The method shows robustness with respect to noise, font sizes and styles.

...read moreread less

47 citations

Journal Article•DOI•

Adaptive, quadratic preprocessing of document images for binarization

[...]

Shan Mo¹, J. Mathews•Institutions (1)

University of Utah¹

01 Jul 1998-IEEE Transactions on Image Processing

TL;DR: An adaptive algorithm for preprocessing document images prior to binarization in character recognition problems using a quadratic system model to provide edge enhancement for input images that have been corrupted by noise and other types of distortions during the scanning process.

...read moreread less

Abstract: This paper presents an adaptive algorithm for preprocessing document images prior to binarization in character recognition problems. Our method is similar in its approach to the blind adaptive equalization of binary communication channels. The adaptive filter utilizes a quadratic system model to provide edge enhancement for input images that have been corrupted by noise and other types of distortions during the scanning process. Experimental results demonstrating significant improvement in the quality of the binarized images over both direct binarization and a previously available preprocessing technique are also included.

...read moreread less

37 citations

Book Chapter•DOI•

Word–Wise Script Identification from Indian Documents

[...]

Suranjit Sinha, Umapada Pal, Bidyut B. Chaudhuri

08 Sep 2004

TL;DR: A robust technique is proposed to extract word-wise script identification from Indian doublet form documents using different topological and structural features to separate different script words from such documents.

...read moreread less

Abstract: In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.

...read moreread less

24 citations

Journal Article•

Script Identification from Trilingual Documents using Profile Based Features.

[...]

M.C. Padma, P. A. Vijaya

01 Jan 2010-International Journal of Computer Science & Applications

TL;DR: A model to identify the script type of a trilingual document printed in Kannada, Hindi and English scripts is proposed and the results are encouraging and prove the efficacy of the proposed model.

...read moreread less

Abstract: In a multi script environment, majority of the documents may contain text information printed in more than one script/language. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, it is proposed to develop a model to identify the script type of a trilingual document printed in Kannada, Hindi and English scripts. The distinct characteristic features of Kannada, Hindi and English scripts are thoroughly studied from the nature of the top and bottom profiles. The proposed model is trained to learn thoroughly the distinct features of each script. Experimentation conducted involved 1500 text lines for learning and 1500 text lines for testing. The k-nearest neighbor classifier is used to classify the test sample. The results are encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99.5% for data set constructed from scanned document images.

...read moreread less

21 citations

Proceedings Article•DOI•

Curvature feature distribution based classification of Indian scripts from document images

[...]

Gaurav Sharma¹, Ritu Garg¹, Santanu Chaudhury¹•Institutions (1)

Indian Institute of Technology Delhi¹

25 Jul 2009

TL;DR: A framework for classification of text document images based on their script and uses edge direction based features to capture the distribution of curvature and a recently proposed feature selection algorithm to obtain the most discriminating curvature features.

...read moreread less

Abstract: We present a framework for classification of text document images based on their script. We deal with the domain of Indian scripts which has high inter script similarities. Indian scripts have characteristic curvature distributions which help in visual discrimination of scripts. We use edge direction based features to capture the distribution of curvature. We also use a recently proposed feature selection algorithm to obtain the most discriminating curvature features. We form hierarchy (automatically) based on statistical distances between the script models. Hierarchy allows us to group similar scripts at one level and then focus on the classification between the similar scripts at the next level leading to improvement in accuracy. We show experiments and results on a large set of about 3400 images.

...read moreread less

4 citations