Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning

doi:10.1109/ICDAR.2017.363

Proceedings ArticleDOI

Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning

TLDR

In this paper, a machine learning-based classifier symbols to Unicode generation scheme was proposed, which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.

Abstract:

Classical OCRs based on isolated character (symbol) recognition have been the fundamental way of generating textual representations, particularly for Indian scripts, until the time transcription-based approaches gained momentum. Though the former approaches have been criticized as prone to failures, their accuracy has nevertheless been fairly decent in comparison with the newer transcription-based approaches. Analysis of isolated character recognition OCRs for Hindi and Bangla revealed most errors were generated in converting the output of the classifier to valid Unicode sequences, i.e., script grammar generation. Linguistic rules to generate scripts are inadequately integrated, thus resulting in a rigid Unicode generation scheme which is cumbersome to understand and error prone in adapting to new Indian scripts. In this paper we propose a machine learning-based classifier symbols to Unicode generation scheme which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.

Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning

Citations

An OCR for Classical Indic Documents Containing Arbitrarily Long Words

References

Indian script character recognition: a survey

An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

Segmentation of touching and fused Devanagari characters

Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis

On the integration of speech recognition and statistical machine translation.

Related Papers (5)

Towards a Robust OCR System for Indic Scripts

Cross-language framework for word recognition and spotting of Indic scripts

Segmentation-based recognition system for handwritten Bangla and Devanagari words using conventional classification and transfer learning

An application of deep learning in character recognition: an overview

Lexicon and hidden Markov model-based optimisation of the recognised Sinhala script

Trending Questions (1)