scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning

TL;DR: In this paper, a machine learning-based classifier symbols to Unicode generation scheme was proposed, which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Abstract: Classical OCRs based on isolated character (symbol) recognition have been the fundamental way of generating textual representations, particularly for Indian scripts, until the time transcription-based approaches gained momentum. Though the former approaches have been criticized as prone to failures, their accuracy has nevertheless been fairly decent in comparison with the newer transcription-based approaches. Analysis of isolated character recognition OCRs for Hindi and Bangla revealed most errors were generated in converting the output of the classifier to valid Unicode sequences, i.e., script grammar generation. Linguistic rules to generate scripts are inadequately integrated, thus resulting in a rigid Unicode generation scheme which is cumbersome to understand and error prone in adapting to new Indian scripts. In this paper we propose a machine learning-based classifier symbols to Unicode generation scheme which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Citations
More filters
Proceedings ArticleDOI
01 Jun 2020
TL;DR: A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
Abstract: OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.

15 citations

References
More filters
Proceedings ArticleDOI
01 Sep 2001
TL;DR: A complete OCR for printed Hindi text in Devanagari script is presented and a performance of 93% at character level is obtained.
Abstract: In this paper, we present a complete OCR for printed Hindi text in Devanagari script. A performance of 93% at character level is obtained.

74 citations

Journal ArticleDOI
TL;DR: This paper describes a method for the correction of optically read Devanagari character strings using a Hindi word dictionary, which is the first work on the use of a Hindiword dictionary for OCR post-processing.
Abstract: . This paper describes a method for the correction of optically read Devanagari character strings using a Hindi word dictionary. The word dictionary is partitioned in order to reduce the search space besides preventing forced matching to an incorrect word. The dictionary partitioning strategy takes into account the underlying OCR process. The dictionary words at the top level have been divided into two partitions, namely: a short-words partition and the remaining words partition. The short-word partition is sub-partitioned using the envelope information of the words. The envelope consists of the number of top, lower, core modifiers along with the number of core charactersp. Devanagari characters are written in three strips. Most of the characters referred to as core characters are written in the middle strip. The remaining words are further partitioned using tags. A tag is a string of fixed length associated with each partition. The correction process uses a distance matrix for a assigning penalty for a mismatch. The distance matrix is based on the information about errors that the classification process is known to make and the confidence figure that the classification process associates with its output. An improvement of approximately 20% in recognition performance is obtained. For a short word, 590 words are searched on average from 14 sub-partitions of the short-words partition before an exact match is found. The average number of partitions and the average number of words increase to 20 and 1585, respectively, when an exact match is not found. For tag-based partitions, on an average, 100 words from 30 partitions are compared when either an exact match is found or a word within the preset threshold distance is found. If an exact match or a match within a preset threshold is not found, the average number of partitions becomes 75 and 450 words on an average are compared. To the best of our knowledge this is the first work on the use of a Hindi word dictionary for OCR post-processing.

38 citations

Proceedings ArticleDOI
23 Sep 2007
TL;DR: The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level, and employs an XML representation for storage of the annotation information.
Abstract: A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.

33 citations

Proceedings ArticleDOI
20 Sep 1999
TL;DR: This paper describes a correction method for optically read Devanagari character strings which uses a partitioned word dictionary and an improvement of approximately 20% in the recognition performance is obtained.
Abstract: This paper describes a correction method for optically read Devanagari character strings which uses a partitioned word dictionary. The word dictionary is partitioned in order to reduce the search space besides preventing a forced match to the incorrect word. The envelop information of words consisting of the number of top, lower, core modifiers along with the number of core characters form the second level partitioning feature for short word partitions. The remaining words are further partitioned using tags. A tag is a string of fixed length associated with each partition. The search process uses a distance matrix for assigning a penalty for a mismatch. An improvement of approximately 20% in the recognition performance is obtained.

29 citations

Proceedings Article
27 Jul 2014
TL;DR: This paper proposes the first formal framework for scripts based on Hidden Markov Models (HMMs) and develops an algorithm for structure and parameter learning based on Expectation Maximization, which is superior to several informed baselines for predicting missing events in partial observation sequences.
Abstract: Scripts have been proposed to model the stereotypical event sequences found in narratives. They can be applied to make a variety of inferences including filling gaps in the narratives and resolving ambiguous references. This paper proposes the first formal framework for scripts based on Hidden Markov Models (HMMs). Our framework supports robust inference and learning algorithms, which are lacking in previous clustering models. We develop an algorithm for structure and parameter learning based on Expectation Maximization and evaluate it on a number of natural datasets. The results show that our algorithm is superior to several informed baselines for predicting missing events in partial observation sequences.

29 citations