Script based text identification: a multi-level architecture

doi:10.1145/2034617.2034630

Proceedings ArticleDOI

Script based text identification: a multi-level architecture

- pp 11

TLDR

The proposed framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages by utilizing texture and shape based information embedded in the documents at different levels for feature extraction.

Abstract:

Script identification in a multi-lingual document environment has numerous applications in the field of document image analysis, such as indexing and retrieval or as an initial step towards optical character recognition. In this paper, we propose a novel hierarchical framework for script identification in bi-lingual documents. The framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages. We utilize texture and shape based information embedded in the documents at different levels for feature extraction. The prediction task at different levels of hierarchy is performed by Support Vector Machine (SVM) and Rejection based classifier defined using AdaBoost. Experimental evaluation of the proposed concept on document collections of Hindi/English and Bangla/English scripts have shown promising results.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Offline Script Identification from multilingual Indic-script documents: A state-of-the-art

Pawan Kumar Singh, +2 more

- 01 Feb 2015 -

Computer Science Review

TL;DR: Various feature extraction and classification techniques associated with the OSI of the Indic scripts are discussed in this survey and it is hoped that this survey will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India.

...read moreread less

Journal ArticleDOI

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Pawan Kumar Singh, +5 more

- 01 Apr 2018 -

Multimedia Tools and Applications

TL;DR: This paper addresses three key challenges here: collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, and development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor.

...read moreread less

Journal ArticleDOI

Hindi Text Document Classification System Using SVM and Fuzzy: A Survey

Shalini Puri, +1 more

TL;DR: A new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic first pre-processes and then classifies textual imaged documents into predefined categories.

...read moreread less

Journal ArticleDOI

A Hybrid Hindi Printed Document Classification System Using SVM and Fuzzy: An Advancement

Shalini Puri, +1 more

- 01 Oct 2019 -

Journal of Information Technology Resear...

TL;DR: A new advanced tri-layered segmentation and bi-leveled-classifier-based Hindi printed document classification system, which categorizes imaged documents into pre-defined mutually exclusive categories by using SVM and Fuzzy matching at character and document classifications, respectively.

...read moreread less

Journal ArticleDOI

Advanced Applications on Bilingual Document Analysis and Processing Systems

Shalini Puri, +1 more

- 01 Oct 2020 -

International Journal of Applied Metaheu...

TL;DR: A journey of bilingual NLP and image-based document classification systems is discussed and an overview of their methods, feature extraction techniques, document sets, classifiers, and accuracy for English-Hindi and other language pairs is provided.

...read moreread less

References

PDF

Open Access

More filters

Book ChapterDOI

Bangla/English script identification based on analysis of connected component profiles

Lijun Zhou, +2 more

TL;DR: Experimental results demonstrate that the proposed technique is capable of identifying Bangla/English scripts on the real Bangladesh postal images.

...read moreread less

Book ChapterDOI

Script and language identification from document images

G. S. Peake, +1 more

TL;DR: In this paper, a uniform text block on which texture analysis can be performed is produced from a document image via simple processing using multiple channel (Gabor) filters and grey level co-occurrence matrices.

...read moreread less

Journal ArticleDOI

Neural network based system for script identification in Indian documents

S. Basavaraj Patil, +1 more

- 01 Feb 2002 -

Sadhana-academy Proceedings in Engineeri...

TL;DR: A neural network-based script identification system which can be used in the machine reading of documents written in English, Hindi and Kannada language scripts and results are very encouraging and prove the effectiveness of the approach.

...read moreread less

Book ChapterDOI

Script Identification in Printed Bilingual Documents

D. Dhanya, +1 more

TL;DR: Techniques to identify the script of a word using Gabor filters with suitable frequencies and orientations are discussed and results obtained are quite encouraging.

...read moreread less

Proceedings Article

Script and Language Identification from Document Images.

G. S. Peake, +1 more

TL;DR: A new method based on texture analysis for script identification which does not require character segmentation is presented, which shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text.

...read moreread less

Script based text identification: a multi-level architecture

Citations

Offline Script Identification from multilingual Indic-script documents: A state-of-the-art

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Hindi Text Document Classification System Using SVM and Fuzzy: A Survey

A Hybrid Hindi Printed Document Classification System Using SVM and Fuzzy: An Advancement

Advanced Applications on Bilingual Document Analysis and Processing Systems

References

Bangla/English script identification based on analysis of connected component profiles

Script and language identification from document images

Neural network based system for script identification in Indian documents

Script Identification in Printed Bilingual Documents

Script and Language Identification from Document Images.

Related Papers (5)

SVM Based Scheme for Thai and English Script Identification

Composite Script Identification and Orientation Detection for Indian Text Images

Zone-based structural feature extraction for script identification from Indian documents

A study on word-level multi-script identification from video frames

Script recognition in images with complex backgrounds