scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning

TL;DR: In this paper, a machine learning-based classifier symbols to Unicode generation scheme was proposed, which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Abstract: Classical OCRs based on isolated character (symbol) recognition have been the fundamental way of generating textual representations, particularly for Indian scripts, until the time transcription-based approaches gained momentum. Though the former approaches have been criticized as prone to failures, their accuracy has nevertheless been fairly decent in comparison with the newer transcription-based approaches. Analysis of isolated character recognition OCRs for Hindi and Bangla revealed most errors were generated in converting the output of the classifier to valid Unicode sequences, i.e., script grammar generation. Linguistic rules to generate scripts are inadequately integrated, thus resulting in a rigid Unicode generation scheme which is cumbersome to understand and error prone in adapting to new Indian scripts. In this paper we propose a machine learning-based classifier symbols to Unicode generation scheme which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Citations
More filters
Proceedings ArticleDOI

[...]

01 Jun 2020
TL;DR: A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
Abstract: OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.

5 citations

References
More filters
Journal ArticleDOI

[...]

TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.
Abstract: Intensive research has been done on optical character recognition (OCR) and a large number of articles have been published on this topic during the last few decades. Many commercial OCR systems are now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters. There are no sufficient number of work on Indian language character recognition although there are 12 major scripts in India. In this paper, we present a review of the OCR work done on Indian language scripts. The review is organized into 5 sections. Sections 1 and 2 cover introduction and properties on Indian scripts. In Section 3, we discuss different methodologies in OCR development as well as research work done on Indian scripts recognition. In Section 4, we discuss the scope of future work and further steps needed for Indian script OCR development. In Section 5 we conclude the paper.

565 citations

Proceedings ArticleDOI

[...]

18 Aug 1997
TL;DR: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent, and shows a good performance for single font scripts printed on clear documents.
Abstract: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.

192 citations

Journal ArticleDOI

[...]

TL;DR: A two pass algorithm for the segmentation and decomposition of Devanagari composite characters/symbols into their constituent symbols and a recognition rate has been achieved on the segmented conjuncts.
Abstract: Devanagari script is a two dimensional composition of symbols It is highly cumbersome to treat each composite character as a separate atomic symbol because such combinations are very large in number This paper presents a two pass algorithm for the segmentation and decomposition of Devanagari composite characters/symbols into their constituent symbols The proposed algorithm extensively uses structural properties of the script In the first pass, words are segmented into easily separable characters/composite characters Statistical information about the height and width of each separated box is used to hypothesize whether a character box is composite In the second pass, the hypothesized composite characters are further segmented A recognition rate of 85 percent has been achieved on the segmented conjuncts The algorithm is designed to segment a pair of touching characters

139 citations

Journal ArticleDOI

[...]

01 Nov 2002
TL;DR: A new technique is presented for identification and segmentation of touching characters based on fuzzy multifactorial analysis and a predictive algorithm is developed for effectively selecting possible cut columns for segmenting the touching characters.
Abstract: One of the important reasons for poor recognition rate in optical character recognition (OCR) system is the error in character segmentation. Existence of touching characters in the scanned documents is a major problem to design an effective character segmentation procedure. In this paper, a new technique is presented for identification and segmentation of touching characters. The technique is based on fuzzy multifactorial analysis. A predictive algorithm is developed for effectively selecting possible cut columns for segmenting the touching characters. The proposed method has been applied to printed documents in Devnagari and Bangla: the two most popular scripts of the Indian sub-continent. The results obtained from a test-set of considerable size show that a reasonable improvement in recognition rate can be achieved with a modest increase in computations.

121 citations

Proceedings ArticleDOI

[...]

04 Sep 2005
TL;DR: It is shown that acoustic recognition scores of the recognized words in the lattices positively and significantly affect the translation quality and a fully integrated speech translation model is built.
Abstract: This paper focuses on the interface between speech recognition and machine translation in a speech translation system. Based on a thorough theoretical framework, we exploit word lattices of automatic speech recognition hypotheses as input to our translation system which is based on weighted finite-state transducers. We show that acoustic recognition scores of the recognized words in the lattices positively and significantly affect the translation quality. In experiments, we have found consistent improvements on three different corpora compared with translations of single best recognized results. In addition we build and evaluate a fully integrated speech translation model.

86 citations