scispace - formally typeset
Proceedings ArticleDOI

Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning

TLDR
In this paper, a machine learning-based classifier symbols to Unicode generation scheme was proposed, which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Abstract
Classical OCRs based on isolated character (symbol) recognition have been the fundamental way of generating textual representations, particularly for Indian scripts, until the time transcription-based approaches gained momentum. Though the former approaches have been criticized as prone to failures, their accuracy has nevertheless been fairly decent in comparison with the newer transcription-based approaches. Analysis of isolated character recognition OCRs for Hindi and Bangla revealed most errors were generated in converting the output of the classifier to valid Unicode sequences, i.e., script grammar generation. Linguistic rules to generate scripts are inadequately integrated, thus resulting in a rigid Unicode generation scheme which is cumbersome to understand and error prone in adapting to new Indian scripts. In this paper we propose a machine learning-based classifier symbols to Unicode generation scheme which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.

read more

Citations
More filters
Proceedings ArticleDOI

An OCR for Classical Indic Documents Containing Arbitrarily Long Words

TL;DR: A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
References
More filters
Journal ArticleDOI

Indian script character recognition: a survey

TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.
Proceedings ArticleDOI

An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

TL;DR: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent, and shows a good performance for single font scripts printed on clear documents.
Journal ArticleDOI

Segmentation of touching and fused Devanagari characters

TL;DR: A two pass algorithm for the segmentation and decomposition of Devanagari composite characters/symbols into their constituent symbols and a recognition rate has been achieved on the segmented conjuncts.
Journal ArticleDOI

Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis

TL;DR: A new technique is presented for identification and segmentation of touching characters based on fuzzy multifactorial analysis and a predictive algorithm is developed for effectively selecting possible cut columns for segmenting the touching characters.
Proceedings ArticleDOI

On the integration of speech recognition and statistical machine translation.

TL;DR: It is shown that acoustic recognition scores of the recognized words in the lattices positively and significantly affect the translation quality and a fully integrated speech translation model is built.
Related Papers (5)
Trending Questions (1)
How can machine learning be used to improve the accuracy of OCR of Indus scripts?

The provided paper does not mention anything about using machine learning to improve the accuracy of OCR of Indus scripts. The paper focuses on improving OCR accuracy for Devanagari and Bangla scripts.