Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning
01 Nov 2017-
TL;DR: In this paper, a machine learning-based classifier symbols to Unicode generation scheme was proposed, which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Abstract: Classical OCRs based on isolated character (symbol) recognition have been the fundamental way of generating textual representations, particularly for Indian scripts, until the time transcription-based approaches gained momentum. Though the former approaches have been criticized as prone to failures, their accuracy has nevertheless been fairly decent in comparison with the newer transcription-based approaches. Analysis of isolated character recognition OCRs for Hindi and Bangla revealed most errors were generated in converting the output of the classifier to valid Unicode sequences, i.e., script grammar generation. Linguistic rules to generate scripts are inadequately integrated, thus resulting in a rigid Unicode generation scheme which is cumbersome to understand and error prone in adapting to new Indian scripts. In this paper we propose a machine learning-based classifier symbols to Unicode generation scheme which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Citations
More filters
01 Jun 2020
TL;DR: A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
Abstract: OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
15 citations
References
More filters
10 Sep 2001
TL;DR: New techniques are presented for identification and segmentation of touching characters based on fuzzy multifactorial analysis and a predictive algorithm is developed for effectively selecting cut-points to segment touching characters.
Abstract: Existence of touching characters in scanned documents is a major problem in designing an effective character segmentation procedure for OCR systems. In this paper, new techniques are presented for identification and segmentation of touching characters. The techniques are based on fuzzy multifactorial analysis. A predictive algorithm is developed for effectively selecting cut-points to segment touching characters. Initially, our proposed method has been applied for segmenting touching characters that appear in Devnagari (Hindi) and Bangla, two major scripts in the Indian sub-continent. The results obtained from a test-set of considerable size show that a high recognition rate can be achieved with a reasonable amount of computations.
8 citations
23 Aug 2015
TL;DR: Adding only a small amount of supervision improves deciphering performance drastically under optimal conditions, especially for short ciphertexts, and in complex real-life scenarios results are better than in the unsupervised baseline approach.
Abstract: In the past unsupervised HMM training has been applied to solve letter substitution ciphers as they appear in various problems in Natural Language Processing. For some problems, parts of the cipher key can easily be provided by the user, but full manual deciphering would be too time consuming. In this work a semi-supervised HMM deciphering approach that uses partial ground-truth data is introduced and evaluated empirically on synthetic and real-life data for Arabic Optical Character Recognition (OCR). Adding only a small amount of supervision improves deciphering performance drastically under optimal conditions, especially for short ciphertexts. In complex real-life scenarios results are better than in the unsupervised baseline approach.
2 citations