Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

doi:10.1007/978-1-84800-330-9_10

Book ChapterDOI

Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

Mudit Agrawal, +2 more

- pp 181-207

Chats0

TLDR

An adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extended, a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

Abstract:

In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

Citations

Offline Recognition of Devanagari Script: A Survey

Detecting and correcting skew in regions of text in natural images

Identifying a maximally stable extremal region (MSER) in an image by skipping comparison of pixels in the region

Feature extraction and use with a probability density function (PDF) divergence metric

Hindi Text Document Classification System Using SVM and Fuzzy: A Survey

References

Online and off-line handwriting recognition: a comprehensive survey

Image analysis via the general theory of moments

Invariant image recognition by Zernike moments

A survey of methods and strategies in character segmentation

The document spectrum for page layout analysis

Related Papers (5)

Indian script character recognition: a survey

An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

Offline Recognition of Devanagari Script: A Survey

Skew angle detection of digitized Indian script documents

Fast connected component labeling algorithm using a divide and conquer technique.