scispace - formally typeset
Book ChapterDOI

Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

Reads0
Chats0
TLDR
An adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extended, a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.
Abstract
In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

read more

Citations
More filters
Journal ArticleDOI

Offline Recognition of Devanagari Script: A Survey

TL;DR: In this paper, the state of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in various sections of the paper.
Patent

Detecting and correcting skew in regions of text in natural images

TL;DR: In this paper, the authors use a camera to capture an image of an environment outside the electronic device followed by identification of regions, based on pixel intensities in the image, and then the process may repeat (e.g. capture image, detect skew, and if necessary request user input).
Patent

Identifying a maximally stable extremal region (MSER) in an image by skipping comparison of pixels in the region

TL;DR: In this paper, a difference in intensities of a pair of pixels in an image is repeatedly compared to a threshold, with the pairs of pixels being separated by at least one pixel (skipped pixel) when the threshold is found to be exceeded, a selected position of a selected pixel in the pair, and at least 1 additional position adjacent to the selected position are added to a set of positions.
Patent

Feature extraction and use with a probability density function (PDF) divergence metric

TL;DR: In this article, each block is subdivided into sub-blocks, and each sub-block is traversed to obtain counts, in a group for each subblock, in order to identify blocks as candidates to be recognized.
Journal ArticleDOI

Hindi Text Document Classification System Using SVM and Fuzzy: A Survey

TL;DR: A new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic first pre-processes and then classifies textual imaged documents into predefined categories.
References
More filters
Journal ArticleDOI

Online and off-line handwriting recognition: a comprehensive survey

TL;DR: The nature of handwritten language, how it is transduced into electronic data, and the basic concepts behind written language recognition algorithms are described.
Journal ArticleDOI

Image analysis via the general theory of moments

TL;DR: Two-dimensional image moments with respect to Zernike polynomials are defined, and it is shown how to construct an arbitrarily large number of independent, algebraic combinations of zernike moments that are invariant to image translation, orientation, and size as discussed by the authors.
Journal ArticleDOI

Invariant image recognition by Zernike moments

TL;DR: A systematic reconstruction-based method for deciding the highest-order ZERNike moments required in a classification problem is developed and the superiority of Zernike moment features over regular moments and moment invariants was experimentally verified.
Journal ArticleDOI

A survey of methods and strategies in character segmentation

TL;DR: H holistic approaches that avoid segmentation by recognizing entire character strings as units are described, including methods that partition the input image into subimages, which are then classified.
Journal ArticleDOI

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Related Papers (5)